Single-Channel Speech Enhancement Based on Adaptive Low-Rank Matrix Decomposition

The low-rank matrix decomposition (LMD) algorithm based on the maximum correntropy criterion (MCC) has recently shown its superiority to other algorithms in classification (e.g., face recognition), and we develop it into single-channel speech enhancement for the low-rank structure of speech signals in the time domain. However, a new issue has arisen: some residual noise exists in the enhanced speech due to its sensitivity to the exact rank value. To address this issue, we propose a novel adaptive LMD (ALMD) algorithm in which the energy threshold technique is adopted to adaptively update the effective rank value of each frame of the speech matrix. Our proposed ALMD algorithm can achieve an acceptable performance for low signal-to-noise ratio (SNR) levels without approximating the speech phase with the noisy phase. We compare ALMD algorithm with common conventional algorithms in Gaussian white noise and non-Gaussian noise conditions. The simulation results demonstrate that ALMD algorithm can achieve its superiority in terms of the segmental SNR (segSNR), perceptual evaluation of speech Quality (PESQ), and short-time objective intelligibility measure (STOI), when compared with tested baseline algorithms.


I. INTRODUCTION
With the development of speech technology, many speech applications have emerged, such as smart phones, hearing aid, and human-computer interaction. These applications play indispensable roles in human life [1], [2]. However, they are often severely damaged by ubiquitous background noise. Therefore, the design of an appropriate speech enhancement algorithm is of great importance. To address this challenging issue, many solutions have emerged during the last fifty years. According to the number of microphones, these approaches could be broadly divided into two patterns: multi-channel speech enhancement and single-channel speech enhancement [3], [4]. Multi-channel speech enhancement requires a series of microphones to get the desired signal information [5], [6]. In this case, the spatial diversity is mainly exploited to separate target speech signal from mixed signals. Beamforming is one of the most commonly technology to perform spatial filtering. It forms a beam of the desired direction speech The associate editor coordinating the review of this manuscript and approving it for publication was Stavros Ntalampiras . signal through the array geometry. While zero trapping to the non-target directional interference source [2], [7]. However, the limitation for beamforming is whose reliance on accurate target source orientation information and acoustic transfer functions (ATFs).
Single-channel speech enhancement has attracted plenty of attention since it requires only one microphone to provide a more computationally appealing solution [8], [9]. Spectral subtraction (SS) [10] is one of the most widely used methods due to its simplicity, which was originally presented by Boll [11]. The enhanced speech is mainly obtained by subtracting the estimated noise energy from the noisy power, where the estimated noise power is detected by voice activity detection (VAD). The major drawback is that it suffers from an uncomfortable signal called ''music noise''. Wiener filtering (WF) [12] is a typical filtering algorithm. The main idea of WF is that a filter function based on the minimizing estimate error criterion is used to get the target speech [13]. It is sufficiently robust to Gaussian additive noise but non-Gaussian noise. An alternative popular work is that several based minimum mean square error (MMSE) [14] schemes have been proposed, such as MMSE short-time spectral amplitude estimator (MMSE-STSA) [15], MMSE-spectrum power estimator based on zero cross-terms (MMSE-SPZC) [16], and MMSE log spectral amplitude estimator (MMSElog) [17]. They perform well to improve speech intelligibility and quality, especially in some noisy conditions. However, these MMSE-based methods are computationally expensive and their performance is sensitive to the prior knowledge of noise distribution. Another more promising work is the subspace approach, which decomposes the noisy speech into the noise subspace and the speech subspace via employing the Karhuenen-Loeve transform (KLT) or singular value decomposition (SVD) [18], [19]. It is well applied with the assumption of subspace of speech and noise being orthogonal, which often fail in actual scenarios [20].
In recent years, a new matrix decomposition theory called robust principal component analysis (RPCA) [21]- [25] was developed for speech enhancement. In RPCA, the observation matrix is effectively separated as a low-rank component and a sparse component, representing speech and additive noise, respectively. Because of the temporal variability of the lowrank component, RPCA can perform well with nonstationary noise [24]. However, traditional RPCA has two drawbacks. The first drawback is that reasonable performance requires prior knowledge of the signal source. The other is that RPCA exhibits high computational complexity because it uses a singular value decomposition (SVD) process for large matrices. Up to the present, various methods have been attempted to alleviate these problems [23], [26]- [29]. Riemannian robust principal component pursuit (R2PCP) [30] is one well-known improved RPCA algorithm, which imposes the tailored Riemannian optimization function to avoid full SVD. However, it assumes that the rank value is fixed and the iteration starting point is random. Constrained low-rank and sparse matrix decomposition (CLSMD) [23] is a notable new method for RPCA-based speech enhancement. The core idea of CLSMD solution is to incorporate rank and sparsity constraints into the decomposition of noisy. It successfully deals with residual noise with a fast convergence speed. However, approximation of the speech phase is required, because it works in the timefrequency domain. Moreover, [31] has shown that an inaccurate phase decreased the quality of speech communication. Is there a new approach without the high computational cost of SVD and the phase processing is not required?
For the purpose of comfortable voice communication, we develop a novel single-channel speech enhancement method based on an adaptive low-rank matrix decomposition (ALMD) method inspired by the low-rank matrix decomposition (LMD) algorithm [32]. In [32], the maximum correntropy criterion (MCC) [33] is adopted to eliminate outlier effects, whilst the combination of half-quadratic (HQ) [34] optimization and greedy bilateral (GreB) [35] paradigm are used to accelerate the computational speed. Therefore, the approach not only avoids SVD but also displays state-ofthe-art performance on face recognition. In addition, in the time domain, the speech signal can be considered a low-rank component because of its short-time stability, whereas the background noise can be considered as the unknown corruption. Consequently, it was possible for us to develop the LMD algorithm for speech enhancement. However, a new issue has arisen that it fails to obtain the desired results for denoising due to its sensitivity to the exact rank value. Here, the energy threshold method is exploited to adaptively update the effective rank value of clean speech.
The excellent properties of our work could be concluded as follows: • To our knowledge, the MCC-based LMD algorithm is the first time developed into single-channel speech enhancement for the low-rank structure of speech signal.
• The proposed method mainly exploits the low-rank structure of speech signals for matrix decomposition in the time domain. As a result, the accurate VAD process and noise estimation are needless herein.
• Since the presented algorithm only performs denoising in the time domain, approximating the speech phase by the noisy phase is not required. Therefore, it will be more conducive to real-time denoising issue.
• Owing to the adaptive estimate of the effective rank value, great performance can be obtained for the task of obtaining the low rank speech components. Experimental results reveal that the ALMD approach can perform significantly better than several common baseline methods for various different types of noise (white, pink, babble, F16, and hfchannel), especially in low signal-tonoise (SNR) situations. The organization of this paper is listed as below. In section II, the relevant theoretical basis of RPCA theory is briefly introduced. Then, speech enhancement based on ALMD approach is described in section III. In section IV, we compare the adopted method with existing ones and analyse the numerical results. Finally, Section V concludes our work.

II. RPCA THEORY
Principal component analysis (PCA) [36], [37] is one of well-known statistical methods for signal processing, can accurately recover the desired matrix with low dimension from high-dimensional observations. However, the performance of the PCA technique will be seriously degraded until large noise or outliers occurs. To address this robustness issue, a modified version called RPCA [38] was originally presented by J.Wright. The core idea is that a given noisy signal Y ∈ R N ×K could be effectively divided into two parts: sparse S ∈ R N ×K and low-rank L ∈ R N ×K matrices (i.e., Y = L+S). Owing to the speech signal has a sparse structure and uncorrelated white noise has a low-rank structure in timefrequency domain. Thus, the RPCA approach could be developed into speech enhancement. In enhancement process of RPCA approach, S and L are utilized to represent the speech and noise components, respectively. The convex optimization problem of RPCA can be described as, where the symbol λ > 0 denotes a weighting parameter, which is a significant factor to trade off the sparsity of S with the low-rankness of L. Also, · * and · 1 is defined as the nuclear norm and the l 1 -norm of a matrix, respectively. Here, the nuclear norm and the l 1 -norm is served as a surrogate for the rank of L matrix and the sparsity of S, respectively. To address the optimization problem in (1), different kinds of efficient algorithms have been presented, such as augmented Lagrangian method (ALM) [39], alternating direction method of multipliers (ADMM) [21] and accelerated proximal gradient (APG) [40]. However, since the delicate balance between the sparsity and low rankness may be not satisfied, the RPCAbased approach is fail to get the desired result.
One of the notable work is called CLSMD [23], which can effectively solve the RPCA problems by incorporating the constraints of low-rankness and sparsity. The noisy matrix could be described as Y = L + S + O, and the symbol O represents a residual noise matrix. Before applying the CLSMD method to recover S and L, the spectral magnitude matrix of noisy is smoothed asŶ . It can be written aŝ where the t and k is denoted as time index and frequency bin of Y , respectively. Then, the CLSMD is performed to deal with the two sub-problems: recovering S and L fromŶ . It can be defined as where n represents the number of iteration. · F denotes the Frobenius norm (i.e., E F = e 2 ij ). Besides, rank(· ) is the rank function of estimated noise matrix, which is usually set to 1 or 2, due to the strong correlation of the column vectors of the background noise. Card(· ) is equivalent to · 1 , which is used to calculate the cardinality value of speech matrix. The drawback of CLSMD approach is that it adopts the noisy phase to approximate the clean speech phase and uses a hard threshold operator to achieve the acquisition of S and L.

III. SPEECH ENHANCEMENT BASED ON ALMD
In this section, we will introduce ALMD-based speech enhancement, which can alleviate the problems of CLSMD and RPCA. More specifically, we first give the problem formulation and then describe the ALMD method.

A. PROBLEM FORMULATION
The task of speech enhancement will be considered in this section. Specifically, the process of acquiring an enhanced speech is performed in the time domain not the time frequency domain, which can address the shortcomings of the approximate speech phase in the CLSMD method. However, owing to the stationary in the short time features of the speech signal, the noisy needs to divide the signal into N frames (y n , n = 1, . . . , N ). Let the symbol y(t) ∈ R K ×1 denotes a frame of the noisy signal vector, which is taken to be the sum of a clean speech vector l(t) ∈ R K ×1 and a noise vector s(t) ∈ R K ×1 , where the symbol K is the length of each frame signal. The additive model can be given as To show the low-rank characteristics of K -dimensional signals more clearly, a matrix transformation is performed on each frame of the signal. It can easily be obtained as where Y , L and S are M × P matrices, M = 2K /3 + 1, and P = K /3. In the case of enhancement, the matrix Y is known, while the L and S matrices are unknown. The ALMD algorithm is utilized to obtain speech matrix L. After obtaining the estimated L, an overlap-add-synthesis is utilized to recover the enhanced speech signal. This process is depicted in Fig. 1.

B. OPTIMIZATION ALGORITHM FOR ALMD
The core idea of ALMD is motivated by the MCC-based LMD method. However, since the speech energy of each frame is unique, it is obvious that it is not feasible to use the fixed rank value. Thus, we will describe the adaptive rank value estimate at the end.

1) MCC AND HQ OPTIMIZATION
Correntropy [41] is a local criterion of similarity, which is often used to process the non-Gaussian noise with large outliers. Meanwhile, it also as a nonlinear similarity evaluation of two random variablesX and G. Moreover, since the correntropy holds the symmetrical characteristics. It can be defined as where the function κ σ (· ) and E [·] denotes the kernel function and the expectation operator, respectively. Besides, σ is the kernel size that can be chosen by the maximum likelihood of density estimate. We will only consider the datax that obeys a Gaussian distribution, so the κ σ (· ) is the Gaussian kernel, i.e., In practice, since only the initial data {(x i , g i )} n i=1 are known but the joint probability density function (PDF), a simple estimator of correntropy could be used to instead of the expectation. It can be expressed aŝ Owing to the locality of correntropy, the similarity of the two variables (i.e.,x = g) is primarily dictated by κ σ (x − g) [41]. Therefore, a new cost function is called MCC that used to evaluate adaptive systems training. Giving as where the symbol θ represents a set of adjustable parameters, and e i is errors of system. Furthermore, the M-estimator [33] is also a maximum likelihood method, defined as min . In that sense, M-estimator is equal to MCC. To address the problem of M-estimator, we make use of the HQ [34], [42] optimization.
HQ technique is a commonly used optimization algorithm for convex or non-convex minimization, which is well used for signal and image reconstruction [42]. The HQ function is defined as where a and b is the adjustable parameter and an auxiliary parameter of the adaptive systems, respectively. b can be obtain by where δ(· ) is a minimization function in Welsch M-estimator.
Owing to the additive form in it, (10) is well utilized to recover corrupted data [34]. Here, we only consider this case. Let us define a loss function of a, i.e., where ϕ(· ) is served as the dual function of ψ(· ) in (13). MCC and HQ optimization could well deal with the problem of LMD, which is introduced in the following section.

2) MCC-BASED LMD APPROACH
To address another issue for the hard threshold operator in the CLSMD method. The MCC theory is adopted to provide an additional penalty for the updating of the LMD algorithm. Moreover, the method can also perform well in non-Gaussian noise conditions because of the property of correntropy [41]. Therefore, the redefined noise W = S + O is modeled by the MCC theory, i.e., where the κ σ (· ) is a Gaussian kernel defined in (7). Based on the HQ optimization approach, an auxiliary variable G is introduced, and ϕ(G) serves as the dual function of ψ(W ) . Thus, G can be easily obtained by (12) and (7), and it is given as where the mathematical symbol indicates the entry-wise product. Therefore, Sub-problems of alternative updating for the MCC-based LDM approach in the CLSMD (3) solution can be rewritten as where G is obtained by (15) when W = Y − L is fixed. However, the GreB [35] strategy aims at dealing with the updating of L in (16) and (3) to reduce failure caused by a biased estimate of the rank r.
In the GreB strategy, a bilateral factorization X = UV is constructed to replace X. In addition, the alternating optimization of U and V is obtained via QR decomposition.
where i is the number of the iteration. The detailed MCCbased LMD approach is summarized in Algorithm 1. The convergence of this approach has been proven to be linear [32]. VOLUME 8, 2020

3) THE ADAPTIVE ESTIMATE OF THE EFFECTIVE RANK VALUE
Since the energy of different frames is time-varying, using a fixed rank value in Algorithm 1 will reduce the recovery effect. Therefore, we introduce an energy threshold technique to update the effective rank value of the speech matrix L. Suppose that speech signal l(t) and additive noise s(t) are zero-mean Gaussian signals, i.e., p(l) ∼ N (0, σ l ), p(s) ∼ N (0, σ n ). It can be readily obtained that

E y(t)y H (t) = E l(t)l H (t) + E s(t)s H (t) , (18)
where (19) is the simplified version of (18). In addition, σ l and σ s can be taken as the powers of the speech and noise, respectively. Let l k (t) represents the signal synthesized by the first k largest eigenvalues, so we can obtain where σ (k) and σ 1 (k) represent the difference power and the power of l k (t), respectively. However, in the non-Gaussian noise scene, formula (19) is not satisfied. The QR transformation can be used to obtain a unitary matrix from the noise matrix and whiten the noisy matrix to satisfy the above conditions. In addition, we use the formula (21) as the adaptive determination of the effective rank value of the speech matrix L. According to (21), when k is greater than the value of the actual rank r in Algorithm 1 (i.e., σ (k) − σ s < 0), then this is a case of overestimating speech, because the recovered lowrank matrix L contains not only speech components but also partial noise components. In contrast, when k is less than the value of the actual rank r (i.e., σ (k) − σ s > 0), it is a case of underestimation where the speech signal has distortion. Thus, only the difference power in formula (21) is equal to or close to zero. The effective rank value of the speech matrix L can be obtained. More specifically, it can be given as where the symbol τ is a minimum value. Therefore, when the value of k satisfies the condition in (22), we can obtain the value of effective rank r. This approach of adaptively estimating the effective rank value is summarized in Algorithm 2.

Algorithm 2 The Approach of Adaptively Estimating the Effective Rank Value
For these experiments, we choose 30 different clean speech signals (sp01∼sp30) based on the NOIZEUS database [43]. This database consists of 30 sentences with a sampling frequency of 8 kHz, and the sentences are produced by three female and three male. In addition, the additive noise included Gaussian white noise and different types of non-Gaussian noise ( pink, babble, F16, and hfchannel) are taken from NOISEX-92 database [44]. The clean speech signal is corrupted by these different types of noise at 0, 5, 10, and 15 dB. The synthesized noisy signals are divided into frames of 32 ms each, and 50% frame overlap is used to ensure signal continuity. The window function is a 256-point Hamming window applied to suppress the Gibbs phenomenon. Overlapadd-synthesis is performed on the final enhanced speech signal to obtain the reconstructed speech matrix. Note that we perform a parallel recovery for all frames.

D. PERFORMANCE METRICS
We make use of three objective performance metrics to quantify and analyze the performance of these algorithms, namely, the segmental SNR (segSNR) [1], the perceptual evaluation of speech Quality (PESQ) [45], and the short-time objective intelligibility measure (STOI) [46], respectively. The segSNR metrics is served to reflect the suppression of interference noise, defined as where l(n) andl(n) denote the n-th frame of clean and reconstructed speech signal, respectively. The symbol N denotes the total number of speech signal frames. Moreover, The PESQ metric and STOI metric are commonly used to predict speech distortion and intelligibility, respectively. It is worth to note that they all calculated on the average.

E. EXPERIMENTAL RESULTS AND DISCUSSIONS 1) PERFORMANCE IN GAUSSIAN WHITE NOISE
In this section, we will display the performance comparisons for the enhanced speech signal degraded by Gaussian white noise. The results are shown in Fig. 2.
As showing in Fig. 2(a), we give the relationship between the average segSNR and input SNR in the Gaussian white noise case. As expected, the above results show that both the average segSNR achieved by different methods are increasing with the input SNR improving. It also can be seen that the average segSNR generated by the ALMD approach is higher than that of the other methods (MMSE-STSA, Minimum mean square error estimator of magnitude squared spectrum (MSS-MMSE), RPCA, CLSMD, and LMD) across the SNR range of 0-15 dB. This may be attributed to the fact that the effective rank technique will be valuable for the suppression of noise. It's interesting to note that the CLSMD solution will make very little noise attenuation when the SNR reaches >10 dB. The MMSE-STSA method display the lower average segSNR than that of the other methods, since it is heavily sensitive to the correctness of VAD. Fig. 2(b) shows the results of the average PESQ scores obtained by different methods. The ALMD method has a significantly higher average PESQ scores than the baselines for the input SNR range of 0-15 dB with 5 dB steps. It also clearly displays that the average PESQ scores obtained by the MSS-MMSE increase rapidly with the SNR increases. This is duo to the fact that this method achieve satisfactory results by incorporating a priori SNR uncertainty [16]. In terms of average STOI scores, we can see from Fig. 2(c), the enhancement method presented herein can also yield significantly higher STOI scores than that of the other five algorithms. In contrast, LMD scheme has a very poor average STOI scores with the rank is fixed operation. These results manifest that the ALMD method outperforms the baselines in Gaussian white noise environment. Fig. 3 shows the waveforms of reconstructed speech (''sp01'') obtained by different methods. The reconstruction is performed in a condition of very severe noise with additive Gaussian white noise at −5 dB. In this circumstance, it is easier to see the enhanced performance of the above mentioned enhancement schemes in the Gaussian white noise scene with a −5 dB SNR. In Fig. 3, it is clearly shown that MMSE-STSA and RPCA led to a notable speech distortion in heavy white noise condition (−5 dB). We believe that this may be due to the following facts: MMSE-STSA needs accurate VAD detection and RPCA needs the prior knowledge of the signal source. Both of these conditions will be unavailable in strong noise conditions. Owing to the effective rank value estimate and the fact that a noisy phase approximation is not needed, the ALMD method yields significantly better waveform than the other methods (MSS-MMSE, CLSMD, and LMD). Unfortunately, there is some distortion. This may be because the strong noise (−5 dB) interferes with the estimate of the effective rank. Besides, the MSS-MMSE and CLSMD demonstrate the similar noise suppression. Comparatively, the proposed method still has better performance in the white noise condition with a −5 dB SNR.

2) PERFORMANCE IN NON-GAUSSIAN NOISE
Additionally, we will introduce the performance comparisons between our proposed algorithm and the baselines in several non-Gaussian noise scenarios (colored pink noise, nonstationary babble noise, F16 noise, and stationary hfchannel noise). First, we adopt the average segSNR metric to reflect the suppression of different types of non-Gaussian noise. The results are presented in Fig. 4. Fig. 4 displays that the ALMD scheme has a substantially higher average segSNR than that of the other methods, as the SNR reaches >10 dB, especially in the F16 noise condition. However, the average segSNR is comparable to the average segSNR obtained from MSS-MMSE approach at the lower SNR levels (0 dB, 5 dB). The CLSMD method achieves a lower average segSNR against these acquired by ALMD and MSS-MMSE for the SNR range of 0-10 dB in various noise conditions (pink, babble, and F16), but does with a relatively higher average segSNR improvement in hfchannel noise condition. This might be due to the hfchannel noie belongs to stationary noise. Consequently, hfchannel noise could be as a low rank component not like the highly nonstationary noise [47].
Second, the average PESQ scores are applied to reflect the speech distortion of the proposed ALMD approach for four different types of non-Gaussian noise. The results are shown in Fig. 5.
We could clearly see from Fig. 5 that our proposed method can obtain a substantially higher average PESQ scores in different kinds of non-Gaussian noise conditions with the input SNR range of 0-15 dB, except in the non-stationary babble noise scene at 0 dB and 5 dB SNR. However, the suggestion of this paper has similar effects to that of MSS-MMSE solution for the SNR less than 15 dB (i.e., 0 dB, 5 dB, and 10 dB). In comparison, the proposed method still perform well in different types of non-Gaussian noise conditions in terms of speech distortion. However, as mentioned earlier, under strong noise conditions, the estimate of the effective rank process will be affected. Thus, the enhanced speech will still contain a portion of the noise information. This fact can explain why the average PESQ scores obtained by the proposed scheme herein are sometimes lower than that obtained by the MSS-MMSE method at 0 dB.
Last, the average STOI scores are used to reflect the speech intelligibility of the presented approach in four different types of non-Gaussian noise conditions. The results are shown in Fig. 6.
We could clearly see from Fig. 6 that our presented approach achieves a significantly higher average scores in STOI metrics, and the improvement of STOI scores indicate that the proposed ALMD scheme can successfully recover the speech signal and that intelligibility is effectively improved. While several other reference algorithms cannot preserve the originality of reconstructed speech. The above results demonstrate that the ALMD scheme still could preserve the speech intelligibility in various types of non-Gaussian noise conditions, especially in low-SNR situations (0 dB, 5 dB).     As can be seen clearly from Fig. 7, the MMSE-STSA, RPCA, CLSMD, and LMD methods have a disappointing loss of speech information. This is that the preconditions for the effective operation of these methods, such as noise obeys Gaussian distribution, prior signal knowledge, the rank of the noise is 1, and the value of the effective rank, which is no longer satisfied in strong non-Gaussian F16 noise condition (−5 dB). In addition, the time-domain waveform achieved by the MSS-MMSE is comparable with the performance of ALMD solution. However, it has some speech information lost in the high frequency part. Comparatively, in the case of non-Gaussian F16 noise, the ALMD method could achieve a better performance than the baselines at −5 dB. However, this method inevitably leaves much residual noise, which will be our future improvement.

IV. CONCLUSION
In this work, we present a novel speech enhancement algorithm based on ALMD scheme that exploits the fact that lowrank characteristics are shown by the speech signals in the time domain. The energy threshold technique is developed to adaptively update the effective rank value of each frame of the speech matrix. Additionally, the enhancement process in the time domain avoids phase loss, which is helpful to improve speech quality and intelligibility. The experimental results confirm that our improved algorithm outperformed the five baseline algorithms in terms of the segSNR, PESQ, and STOI, especially in the case of white noise. However, the proposed method also could yield significantly higher scores of PESQ and STOI than performed by the baselines in the non-Gaussian noise condition. Whilst it has low complexity without loss of evident speech intelligibility and quality. More significantly, our proposed method avoids the need for noise estimation, the VAD process, and approximation of the clean speech phase by the noisy phase.