Speech Source Separation Using Variational Autoencoder and Bandpass Filter

Speech source separation is essential for speech-related applications because this process enhances the input speech signal for the main processing model. Most of the current approaches for this task focus on separating the speech of commonly high-frequency noises or a particular background sound. They cannot clear the signals which intersect with the human speech in its frequency range. To deal with this problem, we propose a hybrid approach combining a variational autoencoder (VAE) and a bandpass filter (BPF). This method can extract and enhance the speech signal in the mixture of many elements such as speech signal, the high-frequency noises, and many kinds of different background sounds which interfere with the speech sound. Experimental results showed that our model can extract effectively the speech signal with 15.02 dB in Signal to Interference Ratio (SIR) and 12.99 dB in Signal to Distortion Ratio (SDR). On the other hand, we can adjust the passband to identify the range of frequency at the output signal to apply for a particular application like gender classification.


I. INTRODUCTION
In many speech-related applications, the quality of the input speech signal holds a significant role in the whole system because it affects directly to the workflow of the main model. To improve the speech signal quality or enhance the speech signal, it is necessary to separate the speech out of the raw input signal. This means that the raw input signal should be separated into a speech signal and the remaining signal called the interfering signal. This process is also called Speech Source Separation (SSS) [1], a specific case of Blind Source Separation (BSS) [2]. SSS is one of the most important tasks to deal with in the pre-processing phase since it controls the signal we push into the main algorithm is good or not. In reality, the interfering signal includes background sounds and noises. The background sounds are the sounds that always exist in the environment and interfere with the human speech such as music sound, traffic sound, or television sound. It cannot be known which and when the background sounds mask the speech signal. These sounds, when mixed into the speech, can cause a lot of deviating results in computation. On the other hand, many kinds of noise usually exist in the input The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney. signal including thermal noise, or the noise caused by the works of signal receivers. If these unexpected elements still exist in the speech signal when it is passed into the main model, the final results will be falsified.
The term ''blind'' in BSS implies that there is no given information about the noise or background, so this causes a lot of difficulties when canceling the background sounds. Since there is no limit for background, their frequencies arrange from very low to very high and then intersect to the frequency distribution of the main signal. Similar to the background, noise can exist at any frequency range and everywhere in the signal. The differences between the backgrounds and the noises are mainly two aspects: amplitudes and distributions. The common backgrounds, such as music sounds or traffic sounds, have big amplitudes and unique distributions so it is easily recognized by the human ear. Differently, the noises do not exist as a clear distribution and are much smaller than human speech. Both background sounds and noises mask the speech signal in different ways so it is a big challenge to reduce their impacts and enhance the speech signals.
In this research, we propose an effective approach to extract the speech signal out of a raw signal. This is the combination of a Variational Autoencoder (VAE) [3], [4] and a Bandpass Filter (BPF). First, we use a VAE to capture the VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ bottleneck features [5] in the input signal. This network will contain most of the important information about the content and prosody of the speech. Then, the signal is filtered by a BPF to capture only the frequency range of human speech, which is useful for the application. Generally, the model includes two main components including a non-deep VAE network and a BPF. This combination not only clears most of the interference from the background sounds and noises to the input sound but also holds most of the important information in the speech signal which is needed for high-level applications.
Back to the development of the SSS problem, many effective methods are proposed and then applied in the industry but none of them solves this problem completely. They are designed to deal with a particular interfering signal such as high-frequency noise or music. Generally, these methods are good approaches for SSS but there are still limited with each of them when they are applied in the real world. The next paragraphs will cluster them into three groups and then discuss more their applied-abilities.
The first group includes the works using transforms [6], [7] or digital filters [8], [9]. Basically, if the background sounds or noises are identified with high reliability, they can be cleared with a particular filter [10], [11]. This approach is usually applied to some cases when the frequencies of these interfering signals are really high or really low. It can be said that a digital filter is a good choice to clear the noises. In some cases, when the frequency range of the interfering signals intersect with the speech signal, some more modern methods like wavelet transform [12] or filter bank [13] can be applied to deal with the problem. With these approaches, they work better than one singular filter but do not clear completely the backgrounds and noises.
The second group uses components analysis techniques as the main approach. Independent component analysis (ICA) [14]- [16] and its variants [17]- [20] are the representatives for this approach. Different from the digital filter, ICA focuses on separating all the components in the signal so it is usually used when the interfering signals are the background sounds such as music sound or radio sounds. Although it can be applied for difficult cases, this method is not a perfect solution for the SSS problem. Because an ICA model is stored as a matrix, the capacity of the model or the total cases that the model can cover is limit [16], [18]. This means that one particular set of parameters for ICA only works for one particular set of components, or one ICA model cannot deal with an unknown background.
While the first approach focuses on denoising signal and the second approach focuses on separating the background sounds, the third and newest approach focuses on learning the distribution of the speech signal and then reconstructing them. To do this, in 2018 Leglaive et al. [21] firstly uses VAE to separate the speech signal of the mixed signal. Because the idea of this approach concentrates on how to learn the distribution of the speech signal [22], [23], the result, different from the two approaches below, does not depend on the background and noises [24], [25]. This means that this solution can be built one time and then used many times with many different interfering signals.
In this research, we inherit the strength of the third and the first approaches to form our solution via a combination. We do not choose the second approach for the combination because it only solves the SSS in particular cases. In the combination, the first component is a VAE which can learn the speech distribution and reconstruct the main content of the speech signal. The second component is a filter that clears all out-of-voiceband in the signal reconstructed by the first component. This component makes our approach different from the pure VAE approach. In a VAE model, the main content of speech is kept and then reconstruct, but with a background that has not existed before, the high frequency is difficultly removed completely. This fact motivates us to apply a filter after processing the signal with a VAE. With this method, we can clear all very high frequencies in the reconstructed signal, or our solution can extract only the speech elements from the mixed signal.
The remaining of this article is structured with 3 main sections. Section II presents many works and researches related to the problem of blind source separation. We also summarize some signal transforms because they are the essential method to translate the signal from the time domain to the frequency domain and on the reverse side. Our proposed model is described in detail via section III. We present a mathematical base, model architecture, and training method for the model in this part. In section IV, we design some experiments to validate our method. After training model, we compare our results with the other works to specify the strengths and weaknesses of our approach.

II. BLIND SOURCE SEPARATION AND SPEECH SOURCE SEPARATION A. SOURCE SEPARATION IN SIGNAL PROCESSING
Given a mixed-signal, the main work, in this case, is how to separate the mixture into N independent signals. There is no information about the mixture and its elements. On the other hand, the way they mix is not known, so it can be a linear mixture or nonlinear mixture. In speech-mixed signals, one element is the pure speech which is created by a human, and the others include background sounds from the environment such as TV sound, music, fan sound, or traffic sound. In that case, the BSS problem is how to extract the speech sound and all other background sounds out of the mixture.

1) BLIND SOURCE SEPARATION
Traditional BSS description is formulated to solve the Cocktail party problem [26]. This means there are m sound sources, supposing human sound and background sounds, and n recording devices. In most cases, m is smaller than n, so the whole system is underdetermined and non-linear approaches should be used to reconstruct the sources. In other cases, the problem can be solved better because there is more provided information, but these cases are not common in the real world. At home or work office, the number of sources corresponding with the number of background is many while the number of recording devices is usually one.
Let s(t), x(t) denote the sets of individual sources and recorded sounds, respectively: Each elements of x(t) is considered as a combination of all sources s i (t) in s(t), so this can be rewrite as follow: All In practice, each x j (t) is masked by noise γ j (t), BSS problem can be described by: Because the noise signal γ j (t) can be solved effectively by using digital filters, the main work in BSS problem is finding inverse matrix of A.
where F(.) is a noise filter; and source elements s(t) can be found by x(t) the inverse of matrix A: When the signals are represented in discrete domain, the equations below can be rewritten by three equations as follow:x

2) SPEECH SOURCE SEPARATION
In this work, we focus on the SSS problem. In many real-world applications such as speech recognition, speaker recognition, or voice virtual assistant, the end devices receive speech signals from users and then throw the response. It is hard to record human voices in a clean environment because background sounds exist everywhere in the house, so it is needed to separate the speech signal of the recorded sound, and that work forms the problem called speech separation. Different from BSS, we do not consider all elements in the sources s[n] in SSS. We only focus on the speech signal, so we can paraphrase the SSS as a specific case of BSS as separating the mixture into the speech signal and the remaining signal. In some situations, we do not care about the remaining element so speech separation means speech extraction. Figure 1 describes a particular illustration of the speech separation problem. Three waveforms are corresponding with three signals. The first is the description of pure human speech. This is the signal which is recorded in a professionally recorded room, so it contains no noise and background sounds. The second signal is a clipped trumpet (an instrument) sound. This sound is clear and clean. We then mix these two signals and Gaussian noise to form the third waveform. In speech separation, the main target is extracting the first signal from the third signal.

B. EVALUATION METHOD
Following by Vincent et al. [27], [28], in BSS problem, the estimated signal of a source signal can be described as a mixture of four elements: (12) or: with: In these formulas, s target (t), e inter (t), e noise (t), e artif (t) are the expected signal, the interference of more than one sources in the mixture, the noise, and the environment background like music or electric fan sounds, respectively. In speech separation, we only consider speech signals in the mixture, so we do not need to estimate and decompose for the other sources.
To evaluate the performance of the separation process, Vincent et al. [27] proposes many measures including Source to Distortion Ratio (SDR), Source to Interferences Ratio (SIR). They are too similar, the only difference is that SDR reflects total distortion introduced by both interfering signal and processing method, while SIR measures the distortion introduced by the background sound [13]. Withx and s est are the input and output of the whole model, the total distortion is computed by: with ||.|| and · denote second norm and dot product, respectively. On the other hand, estimated signal s est can also be rewritten as: So the total energy of noise e is computed via: When s est [n] ·x[n] → 0, D → +∞, SDR can be described approximately by: The distortion caused by the interfering signal is: Therefore, SIR is identified by the equation below: SIR = 10 × log 10 1 D inter (20) In our experiments, the interfering sounds are particular sounds that are masked into the speech sound. So we can compute e inter easily by computing the energy of these sounds and then compute the SDR and SIR.
On the other hand, we also apply the Perceptual Evaluation of Speech Quality (PESQ) [29] to measure the quality of the output speech signal. This measure is an industry-standard and widely applied for voice device manufacturers. Because this research aims to apply the proposed approach for application, we use PESQ to evaluate the output voice.

C. VOICEBAND OF THE SPEECH SIGNAL
Although the audible range of human ears in the frequency domain is from infrasound (20 Hz) to ultrasound (20,000 Hz), the real distribution of speech elements is not uniform. Figure 2 shows a clear illustration of this distort distribution. In this case, the most dense area is from f min ∼ 20 Hz to f max ∼ 4, 000 Hz. Phonetic researches show the fact that most of content in the speech spreads in the range under f ∼ 3, 400 Hz [30], [31]. The number f may be different based on the researches, but it always concentrates on value 3, 400 Hz. This range is usually called voice-band.On the other hand, the frequency elements under f max ∼ 3, 400 Hz presents the speaker properties or other suprasegmental features such as dialect or emotion. The likelihood of whether a person can recognize a human speech is his acquaintance or not depends on the distribution of the signal in this range. Not to miss anything of speech, we analyze the signal in the range from f begin = 20 Hz to f end = 5, 000 Hz to commit that we capture all information from the input signal in the frequency domain.

III. OUR APPROACH FOR SSS: COMBINING A VAE AND A BPF
Our solution for the BSS problem is the combination of a VAE network with a Chebyshev filter in the frequency domain. The mixed signal is transformed by Short Time Fourier Transform (STFT), then pushed to the processing block, and finally is computed by the Inverse Short Time Fourier Transform (ISTFT) algorithm to reconstruct into the time domain. The illustrations for this whole process is described in figure 3.
From left to right, respectively, the blocks are source signal s, mixturex, processing blocks, and estimated signal s est at the end. The model mainly processes input in the frequency domain, then reconstructs the signal to time-domain via the ISTFT algorithm. Finally, the performance and quality of the model are validated by comparing the estimated signal and the original speech signal. STFT is present by the formula: In the STFT formula, the left part presents the amplitude of each element in the time-frequency domain. Particularly, F(τ, ω) is the amplitude at time τ of frequency ω. On the other hand, the window function w [.] in the right part defines where and how the sub-range of the signal is taken to present into the frequency domain. In this research, we use Blackman -Harris function, a special case of Hamming function.
From the frequency domain, the signal is converted to time domain using ISTFT via the equation below:

B. VARIATIONAL AUTOENCODER
In the frequency domain, we aim to transform or convert the input (speech signal with noise) to the output (clean speech signal). We build a machine learning model that can identify if a harmonic element belongs to the original speech signal or noise and then try to keep all elements of the speech signal.
With this approach, the model can filter and hold most of the useful information to reconstruct the original speech. In this work, an autoencoder's well-known variant VAE is used as the main processor for this process. Autoencoder [32], [33] is a special kind of neural network which is usually used to extract features or denoise the input. Ideally, the input and output of autoencoders are the same because this network aims to compress the input data and then reconstruct it. To do this, the network contains two sub-network named encoder and decoder and these two networks are linked by a small layer called Code which is smaller than the input and output layers. Encoder network compresses all information in the input layer to the Code layer, and then, the decoder network reconstructs the information from the Code layer to the Output layer. If the value at the output layer is nearly equal with the input, that means the main information can be reconstructed with the Code layer, or the Code layer contains most of the important information of the input layer.
Let , , x, h, y, σ, w, b denote data space, code space, input, code value, output, activation function, weight, and bias. A 3-layer autoencoder can be formulated as follow: Then the loss function is: Particularly, the loss value for ith point is computed by adding the point index to the equation below: (27) In real application design, an autoencoder can contain n > 1 layers in both encoder and decoder sub-networks. In this case, the formulas to compute the values at the hidden layers are similar to the formula (23) with the only difference is the output of a layer is the input for the next layer, or the formula can be described as a nested function: with h i , σ i , w i , b i are the value, activation function, weight, and bias at ith hidden layer. Assuming there are n layers in the autoencoder including one input layer, one output layer, and n − 2 hidden layers, we denote p(.), q(.) the encoder and decoder networks. Generally, the loss function is described by: or: for ith sample. With activation function, we use Leaky ReLU [34], [35] for all layers except at the output layer: At the output layer, we do not use any specific activation function because the main purpose of this layer is to reconstruct the value at the input, so we apply Identity function for this layer: In an autoencoder model, the value at the code layer of a particular input is a fixed vector, but this representation does not reflect correctly the truth in the real world. Let us consider a human sound like /s/ each person pronounces this sound differently, but the signals have some similar properties to help people recognize exactly the sound. If the sound is represented by a probability distribution, it describes the signal better in comparison with a fixed number or a vector. That is the idea of a VAE, a more powerful variant of the autoencoder. Different from traditional autoencoders, the code layer in VAE is supposedly created by a prior distribution which is computed from the encoder network. In most real cases, the prior distribution is Gaussian or normal distribution, so the value at the code layer can be sampled from N (µ, ) as the left illustration in figure 4 [36].
Let X , Q(.), z, P(.) and f denote input, encoder network, code value, decoder, and output. When training the network, if z is sampled from a distribution, it is a random variable. This leads to the fact that encoder Q(.) can not be updated its parameters via the backpropagation algorithm, so the network cannot be learned. To solve this problem, z is computed as follow: with is a new random variable sampled with N (0, 1) distribution. With this trick, the error from z can be propagated to encoder Q(.) through µ, , which are computed the error easily via z. of VAE [36] are to reconstruct the input and to maintain the Gaussian distribution in the code layer, so the loss function at ith point is the summary of these two elements: (34) with KL(q(x)||p(x)) is the Kullback-Leibler divergence [37], [38] of two probability distribution functions. This is the measure used to compute the similarity of these two distributions: or: When z = µ + × and ∼ N (0, 1), loss function is rewritten as follow: with L, J are the length of code layer and output layer, respectively. Finally, the total loss of VAE is the sum of (26) and (37) formulas, so it can be rewritten as follow: Because VAE is a kind of multi-layer neural network, there is nothing different in training and inferring processes in comparison with a normal neural network. We use the Backpropagation algorithm [39], [40] to adjust all parameters in the network in the training phase and forward propagation for inference.
We use VAE to process the signal in the frequency domain. This means the input of VAE is the STFT of mixed-signal, and the expected output is the STFT of the speech signal. We then use this frequency representation to compute the value of speech signals in the time domain.
In setting, our network is different from the network below as our network is a little bit changed. The target output is not the same as the input. In our model, the input is the mixed signal when the output is the pure speech signal. Let us say this model receives the mixture between the main signal, in this case, it is a speech signal, with some unexpected signal, and process it to return the original speech signal. The details of our design are specified in the experimental result.
The inferring speed depends on the complicated of the model including the VAE network and the filter. Because the filter runs with a fixed cost, the model complexity mainly depends on the VAE network. Let n, k denote the number and the maximum size of layers in this network. An input sample, correspond with a k-dimension vector, will be passed through n multiplication between a k-dimension vector with a kxk-dimension matrix. The cost for each multiplication is O(k 2 ) so the total computation complexity is O(nk 2 ). Most of the cases, because n is not a large number, the model is not complicated in computation.

C. BANDPASS FILTER
After reconstructing speech signal via the autoencoder network, we use a BPF to eliminate all frequency elements which are out of the common range of human speech. In particular, most human speech signal spreads in the range from 50 Hz to VOLUME 8, 2020 5000 Hz. With this process, the main energy of the speech signal is kept while the remaining elements are ignored. We choose filter Chebyshev type 1 [41] for this research because it is good enough and fast processing.
The BPF is designed to pass the signal through the band 50 Hz and 5000 Hz. This is the combination of a high pass filter with the cutoff frequency is 50 Hz and a low pass filter with the cutoff frequency is 5000 Hz. The remaining of this section describes in detail the low pass filter while the high pass filter is designed as a similar method.
The gain or amplitude response of the Chebyshev low pass filter is: with r is the ripple factor, T n is nth order Chebyshev polynomial, and ω 0 is the cutoff frequency. Parameter r is determined by: r = 10 10 − 1 (40) with is the passband ripple, a constant which is usually set by a small number to show the difference between maximum and minimum values of gain in passband region ( fig. 5). T n is nth order Chebyshev polynomial [42], which is a recursion function:

IV. EXPERIMENTAL RESULT A. DATASET
In this work, we use the trumpet sounds, water sounds, and traffic sounds as the background sounds. Particularly, the backgrounds are chosen because of their properties. First, we evaluate our model with a clear and clean background so we choose an instrument sound (trumpet in this case). The spectrum of this background has a clear and stable distribution. It is different from human speech distribution, so this test is the easiest case to evaluate our method. Second, we test our approach with an unclean background: we choose many kinds of water sound including waterfall, rain, stream, and the running water at the faucet. The distributions of water in these cases are not stable and sometimes interfere with the speech signal. This test is more difficult than the test with an instrument. The final background is a more complicated sound: we choose traffic sounds. This kind includes many different sources such as engine sounds, car horns sound, blowing wind, etc. The recorded sound is a mixture of many known elements and unknown elements. This is the most difficult test case for our method.
To present the main speech signal, we use the TIMIT [43]. This dataset contains a lot of recording speeches from 630 speakers. Each of them is recorded 10 times with 10 different long sentences. There are eight main dialects of English in TIMIT, so it helps us to evaluate our approach in many cases with different kinds of speech sounds. With each dialect, we choose randomly a lot of samples from the dataset (depending on the particular experiments) and then mix them with the background sound to form the mixed sound. After that, the mixed sound is mixed with random Gaussian noise to create the mixture. This sound is quite similar to the sound in the real world, where the recorded sound is always masked by random noise during the recording process. Let us assume with a speech signal s[n], we get its mixturex[n]. Thus, we can represent a pair of input and output of the whole model as follow: The background sounds are cloned into many versions with many different powers by multiplying the sound signal with an array of real random numbers. Each version corresponds with each level of the magnitude of background sound. This means that when we multiply the signal with a big number, the background in the mixture is too big, maybe bigger than the speech sound. In this way, we can evaluate the performance of our design whether it can extract the speech signal from a noisy environment or not.
On the other hand, we also use VIVOS dataset [44], a commonly used dataset for Vietnamese speech recognition to check whether our proposed approach can work independently with the language. Basically, this dataset is organized similarly to TIMIT. The differences between these two datasets include the languages, number of recorders, and the recording per recorder. In experiments, we use a random subset of these datasets instead of using all of their recordings. This way helps us to validate the model many times with many distinguished test cases.
In the testing phase, we mixed the background sounds and Gaussian noise into the speech sounds with 3 dB and 10 dB in terms of SDR. We choose these parameters to simulate the common environment in the real world.

B. HYPERPARAMETER FOR MODEL
At STFT block, we set the size for window function 20-millisecond. Two consecutive frames overlap 10 milliseconds. Each frame then multiplies with the Backman-Harris window function. Next, the frame is transformed by STFT with 128 factors. These factors are the complex numbers, then we replace them with 256 factors including 128 real and 128 image coefficients. Then the signal is represented by an array of 256-dimension vectors.
We use many configurations for our autoencoder with the size of layers is diversity. Supposing we are processing a frame of a sample, let us denote the input and output of autoencoder network by: with X and S are the representations for the input and output in 42. Besides, i and j are the numerical index of speech signal sample and the numerical index frame in the sample. This means that the autoencoder transforms STFT values from the mixture into a pure speech signal.
After reconstructing the speech signal by VAE, we apply the Chebyshev BPF with 4th order to clear all out of range frequency elements. We set passband ripple = 3dB and stopband −40dB. Besides, the low frequency f min is set at 20 Hz and the high frequency f max is at 5000 Hz.

1) ONE DIALECT VERSUS ONE BACKGROUND
There are 8 dialect sounds in TIMIT and we use all of them in this experiment. With each dialect, we chose randomly ten people with one hundred utterances. Then we mixed all of these utterances with the trumpet sounds and Gaussian noise and used 90 samples in the training phase, 10 samples in the testing phase. After the model had been convergent, we reconstructed ten remaining samples to get the output signals and finally compared them with the ground truths.
As can be seen in table 1, all 8 cases show that our model can extract effectively the speech sound out of the mixture because the SIR and SDR are positive. Although the results are good, they are not stable. In dialect 4, the result is too high, while in dialect 5 and 8, the results are much lower. In speech separation, the difficulty of the problem is represented by the difference between the human sound and the background sounds. If the properties and the distributions of these sounds are similar, the model separates them in an imperfect way. In dialect 5 and 8, their sounds have the high tones and the distribution in each range in the frequency domain is overlapped partly with the trumpet sounds. It leads to the fact that the results are not good in this case. This similarity also means that with a particular dialect, the optimal configuration for VAE and hyperparameters for are not the same, and we should try with many configs to find the best solution for a particular dialect.

2) MANY DIALECTS VERSUS ONE BACKGROUND
In this experiment, we chose randomly 100 utterances from TIMIT and then mixed them with trumpet sounds with many different amplitudes and Gaussian noise. After that, we processed them with our proposed model to extract speech signal. The particular results are shown in table 2.  The results in this experiment are generally better than in the previous experiment. This can be explained by the fact that the data distribution in these experiments is not the same. When training with one dialect, the model biased to that dialect and fell into a non-universal solution. If the model is trained with many different types of dialects, it can learn much unequal distribution from data. On the other hand, some dialects are more common and then include more samples than the others. When we selected randomly from the whole dataset, the samples per dialect were not balanced, followed by the real distribution. This leads to the fact that the result, in this case, is not equal to the previous experiment, and particularly, better and closer to the real applications.

3) ONE DIALECT VERSUS MANY BACKGROUNDS
The main difference between a VAE and a filter or an ICA model is what the model learns. While the filter and the ICA model learn how to clear the interfering signal, the VAE learns the distribution of speech signals. This fact leads VAE can works within many different kinds of background sounds and noises because the VAE model considers these signals as the remaining signal after the separation process. In a mixture, the VAE can identify exactly where the speech signal is and then extract it out of the mixed signal. To evaluate this ability of VAE, in this experiment, we masked the speech sounds by many kinds of backgrounds and Gaussian noise and then performed them with the proposed model. Table 3 shows that all SDR and SIR values are positive. Although there is a difference between the results of the different backgrounds, this difference is not significant. The result of this experiment demonstrates that the proposed approach does not depend on the interfering sounds.

4) MANY DIALECTS VERSUS MANY BACKGROUNDS
As the same purpose with experiment 3, in this test, we would like to check if the proposed model can learn the human  speech distribution or not. In the previous experiment, we only used one dialect for each test case. That test was much easier in comparison with this test because the distribution of all 8 dialects was more complicated than each dialect. If the model worked well in this test, it could be inferred that this approach could be extended and applied for many real applications.
The result in table 4 shows that the combination between a VAE and a BPF can be applied for many dialects and many backgrounds. Despite the fact that the values of SDR and SIR meaures are not so good, the result is stable in many test cases. This is a evidence to infer the proposed model can be used in the real applications.

5) SPEED OF SEPARATING PROCESS
In this test case, we implemented some different configurations for our VAE to show the relationship between the complexity, speed, and performance of our models. In these configurations, the middle layers were the latent layers or code layers z. The number shown in the table 5 are the number of latent dimensions, not the real number in use. Particularly, the size of the code layer is twice the size in table 5 because each latent dimension requires two parameters including a number for the mean and a number for the variance. For example, in case 1, the configuration [256,80,35,80,256] means that the size of the code layer is 35 dimensions with 70 nodes including 35 nodes for means and 35 nodes for variances.
We reused the data used in the first case in the experiment IV-C2. Particularly, that set contains 100 utterances, which were randomly chosen from the TIMIT dataset. In the training phase, we used 90 samples of them and then inferred 10 remaining ones. All testing samples are extracted randomly 1-second per sample before passed into the processing block. In this experiment, we used a CPU Intel Core i5 3.0 GHz with 8GB RAM to evaluate the processing speed of proposed method. The results with many different configurations are shown in table 5.
In table 5, from the results in these five test cases, we can conclude that the performance of our approach mainly depends on two factors including the depth of VAE and the size of the code layer z. The deeper VAE, the better result, and we should choose carefully the size for code space. If the size of the code is too big, the model is bigger and then run slower. It also presents the data into a sparse space so the reconstructing phase works ineffectively. If the size is too small, it cannot memorize all the needed information to reconstruct the original signal. To choose the optimal size, we need to analyze the data and then try many times with many different configs.

6) A TEST CASE WITH WHOLE TIMIT DATASET
In this test, we chose the 3rd configuration for VAE in experiment IV-C5 for training and testing. We used the whole TIMIT for this test case. Particularly, we used the default separation of TIMIT for the training and testing process.
From table 6, in comparison with the other methods including wavelet [12], time-frequency filter bank [13], ICA [17], and VAE [25], our model gets a high result. We achieve 12.99 dB in SDR measure, which is the highest score, and 15.02 dB in SIR. These results are good evidence to demonstrate the efficiency of the proposed hybrid approach.
In our model, the VAE component plays the main role in the separating process. When we only apply VAE for SSS, the total distortion SDR is nearly approximate the best result by [13]. The result by only BPF is much lower than only VAE in terms of both SDR, SIR, and PESQ. If these two components are combined to form the full model, it achieves a higher PESQ than the result at [25].

7) AN IMPLEMENTATION FOR MULTILINGUAL SPEECH
This experiment is aimed to check whether the VAE approach is dependent or independent with the language. In this test, we also used the VIVOS dataset instead of only TIMIT. This dataset is commonly used for Vietnamese speech recognition research. It contains over 28.000 utterances which are recorded from nearly 50 people. We implemented this test in the same way with the test IV-C2. We chose randomly five subsets from VIVOS with 100 samples per set, which is equal to each test case with TIMIT. Then we divided them into the training set and testing set with the same size as the first experiment. We also mixed 50 random samples from the VIVOS subset with 50 random samples from the TIMIT subset and then used them as the third data subset. The particular results are shown in figure 7.
From table 7, the results show us the fact that the proposed model can extract speech signals from the mixture efficiently. Both SDR and SIR measures are positive and high in all   cases. This means that our approach does not depend on the languages and only depends on the distribution of the data in the dataset.
Although the experimental results when testing the model with TIMIT and VIVOS are positive, the particular result in VIVOS is lower a little bit than TIMIT. This can be explained as follow: Vietnamese (the language in VIVOS dataset) is complicated in phonetics and phonology aspect. It contains 6 tones, so in spoken language, people use more high-frequency signal to express the tone. This leads to the fact that the total energy in Vietnamese speech spread wider than in English (the language in TIMIT). To get a better result in Vietnamese, we need to change the configuration with a bigger size for the code layer and higher f max in the filter. Generally, our proposed model can be applied to many languages with a little change in the model configuration. This fact can be paraphrase as our method is mostly independent of language.
On the other hand, the lowest performance belongs to the mixed dataset. Generally, the distribution of English (in TIMIT) and Vietnamese (in VIVOS) are not the same in the frequency domain. This leads to the fact that the model learns the general distribution more hardly. Due to this not excellent result, both SDR and SIR is positive, this means our model can be applied for some case when there is more than one language in the speech.
In figure 6, we describe 4 waveforms including waveform of the original speech, background sound, mixture, and reconstructed sound. As can be seen, the reconstructed, or estimated signal has the form too similar to the original form and very different from the mixture. This demonstrates that our model can reconstruct the original speech from the mixture so well.

8) APPLYING SSS PROPOSED MODEL TO GENDER RECOGNITION
We integrated our model as a preprocessing component into a gender recognition system. Many current applications such as voicebot or recommendation systems use the information of the user such as gender to suggest suitable content. To evaluate the performance of our model, we compared three tests: clean, mixture, and reconstructed data. In this experiment, we used a training set from TIMIT for training. Then we mixed the TIMIT test set with trumpet sounds and Gaussian noise to form the mixture. We finally separated the speech out of the signal by our proposed method. All of these three kinds of test samples were passed to the recognizer to verify whether our model works or not. The particular results are shown in table 8. In table 8, the results of 9 recognizers are too similar. This means that our model is stable and usable for many different algorithms. When we recognize the mixture, due to the impact of the background and noise, the accuracy of recognizers is much lower than the test with clean data. With the data which are processed by our model, the results are improved significantly and reach nearly the test with clean data. This experiment is evidence that our approach can be used for a real application.

V. CONCLUSION
In this work, we propose a new design for the speech separation problems using a combination between a variational autoencoder (VAE) and a bandpass filter (BPF). With this combination, our model can clear the interfering signal and noise in not only out of voiceband but also intersect with the speech signal. Particularly, we use VAE, a generative model, to reduce the impact of intersection elements on the main signal and BPF to clear all out of voiceband elements. The experimental results show that our approach is more effective than many works before. In many tests, our model shows its good results in both signal to distortion ratio (SDR), signal to interference ratio (SIR), and perceptual evaluation of speech quality (PESQ) measures with high positive values. It works on many kinds of dialects, many kinds of backgrounds, and noises. On the other hand, because this approach does not depend on language, it can be re-configurated and tuned to deal with many real applications. Finally, our last experiment shows that our approach can be used in a preprocessing component in a real application like gender classification and it works stably with many different algorithms in the main model.