Blind Monaural Source Separation on Heart and Lung Sounds Based on Periodic-Coded Deep Autoencoder

Auscultation is the most efficient way to diagnose cardiovascular and respiratory diseases. To reach accurate diagnoses, a device must be able to recognize heart and lung sounds from various clinical situations. However, the recorded chest sounds are mixed by heart and lung sounds. Thus, effectively separating these two sounds is critical in the pre-processing stage. Recent advances in machine learning have progressed on monaural source separations, but most of the well-known techniques require paired mixed sounds and individual pure sounds for model training. As the preparation of pure heart and lung sounds is difficult, special designs must be considered to derive effective heart and lung sound separation techniques. In this study, we proposed a novel periodicity-coded deep auto-encoder (PC-DAE) approach to separate mixed heart-lung sounds in an unsupervised manner via the assumption of different periodicities between heart rate and respiration rate. The PC-DAE benefits from deep-learning-based models by extracting representative features and considers the periodicity of heart and lung sounds to carry out the separation. We evaluated PC-DAE on two datasets. The first one includes sounds from the Student Auscultation Manikin (SAM), and the second is prepared by recording chest sounds in real-world conditions. Experimental results indicate that PC-DAE outperforms several well-known separation works in terms of standardized evaluation metrics. Moreover, waveforms and spectrograms demonstrate the effectiveness of PC-DAE compared to existing approaches. It is also confirmed that by using the proposed PC-DAE as a pre-processing stage, the heart sound recognition accuracies can be notably boosted. The experimental results confirmed the effectiveness of PC-DAE and its potential to be used in clinical applications.


I. INTRODUCTION
cently, biological acoustic signals have been enabling various intelligent medical applications.For example, the biological acoustic signals of the heart and lung can facilitate tasks such as diagnosing the cardiovascular and respiratory diseases, and monitoring the sleep apnea syndrome [1][2][3][4][5][6][7][8].Previous studies have already investigated the physical models of the heart and lung sound generation and classification mechanisms.For example, signal processing approaches (e.g., normalized average Shannon energy [9] and high-frequency-based methods [10]) and machine-learning-based models (e.g., neural network (NN) classifiers [11] and decision trees [12]) have been used to perform heart disease classification based on acoustic signals.
In addition, the information of S1-S2 and S2-S1 intervals has been adopted to further improve the classification accuracies [12], [13].On the other hand, Gaussian mixture model [13] NN classifiers [14], and support vector machines [15] along with various types of acoustic features (e.g., power spectral density values, Hilbert-Huang transform [16]) have been utilized to carry out lung sound recognition [17,18].However, medical applications using such biological acoustic signals still face several challenges.
To reach accurate recognition, sound separation is one of the most important pre-processing.Because the measured signal is usually a mixed version of the heart and lung sounds, and pure heart/lung acoustic signals is generally not accessible, effectively separating heart and lung sounds is very challenging.The frequency ranges of normal heart sounds (first(S1) and second(S2) heart sound) is mainly 20-150 Hz, and some high-frequency murmurs may reach to 100-600 Hz, or even to 1000 Hz [19].On the other hand, the frequency range of normal lung sounds is 100-1000 Hz (tracheal sounds range from 850 Hz to 1000 Hz), abnormal lung sound as adventitious sounds of wheeze span a wide range of frequencies variation of 400-1600 be highly overlapped.This results in interference between the acoustic signals and may degrade the auscultation and monitoring performance.With an increasing demand for various acoustic-signal-based medical applications, effective heart and lung sound separation techniques have become fundamental, although challenging.
Sound separation techniques for heart and lung have been studied extensively, and numerous methods have been proposed so far.For example, the study [22][23][24][25][26] focuses on the adaptive filtering approach while Mondal et al. [27,28] use the empirical mode decomposition methods.Hossain and Hadjileontiadis et al. [29,30] proposed to use the discrete wavelets transform approach to filter interference.Pourazad et al. [31] derived an algorithm that transforms the signal to time-frequency domain (STFT) and combined with the continuous wavelets transform (CWT) to filter out heart sound components by a band-pass filter.
However, the above-mentioned traditional filtering approaches encounter difficulties due to the overlapped frequency bands.The works in [32][33][34] proposed the blind source separation algorithms, including independent component analysis (ICA) and its extensions, in which the prior knowledge of sources is not required.Nevertheless, the ICA-based methods require at least two sensors and thus, do not work for the devices having only single-channel [35][36][37].The assumption of independence between heart sound sources is somehow optimistic.
Recently, the supervised monaural (single-channel) nonnegative matrix factorization (NMF) was adopted to separate different sources [35,38].It was recognized for its capability of handling overlapping frequency bands [39,40].More recently, deep learning approaches have been used for source separation [40][41][42][43].Although these deep models directly dismantle the mixture source into the target ones and outperform the NMF approach, those frameworks were subject to supervised training data.However, in biomedical applications, the training data of pure heart/lung acoustic signals is difficult or too expensive to measure.
To overcome the mentioned challenges, this paper proposes a periodicity-coded deep autoencoder (PC-DAE) approach, an unsupervised-learning-based mechanism to effectively separate the sounds of heart and lung sources.The proposed algorithm first adopts the DAE model [40,[44][45][46] to extract highly expressive representations of the mixed sounds.Next, by applying the modulation frequency analysis (MFA) [47] on the latent representations, we can group the neurons based on their properties in the modulation domain and then perform separation on the mixed sound.The advantage of PC-DAE is that the labeled training data (more specifically, paired mixed sounds and individual pure sounds) are not required as compared to the typical learning-based approaches.It benefits from the periodicity structure to provide superior separation performance than the traditional methods.The remainder of this paper is organized as follows.In Section 2, we will review the NMF and DAE algorithms.In Section 3, the proposed PC-DAE will be introduced in detail.In Section 4, we present the experimental setup and results, where two datasets were designed and used to test the proposed PC-DAE model.The first one is phonocardiogram signals from the Student Auscultation Manikin (SAM) database) [48] , and the second one is prepared in a real-world condition.Experimental results confirm the effectiveness of PC-DAE to separate the mixed heart-lung sounds with outperforming related works, including direct-clustering NMF (DC-NMF) [35], PC-NMF [49], and deep clustering (DC) [45], in terms of three standardized evaluation metrics, qualitative comparisons based on separated waveforms and spectrograms, and heart sound recognition accuracy.

II. RELATED WORKS
Numerous methods have been proposed to separate the heart and lung sound signals.Among them, the NMF is a notable one that has been applied to separate different sounds [35,38].The DAE model is another well-known approach.Based on the model architecture, the DAE can be constructed by a fully connected architecture, termed DAE(F), or by a fully convolutional architecture, termed DAE(C).In this section, we provide a review of the NMF algorithm, DAE(F), and DAE(C) models.

A. Non-negative matrix factorization (NMF)
The conventional NMF algorithm factorizes the matrix  into two matrices, a dictionary matrix  and an encoded matrix .The product of the  and  approximates matrix .All the matrices entries are nonnegative.The NMF-based source separation can be divided into two categories, namely supervised (where individual source sounds are provided) and unsupervised (where individual source sounds are not accessible).For supervised NMF-based approaches, a pre-trained, fixed spectral matrix  , where   …  , and A is the number of sources, which consists of the characters of each sound source is previously required [35,50].To process NMF, first, the recording that consists of multiple sounds was factorized by NMF into  and  .Then  is divided into A blocks:   …  .Through multiplying  and  (i=1,…A), we obtain individual sound sources.
For unsupervised NMF-based approaches, since individual source sounds are not available, some statistical assumptions must apply.An intuitive approach is to cluster the vectors in H to several distinct groups.A particular sound can be reconstructed by a group of vectors in H along with W. The work of Lin et al [49], on the other hand, designed PC-NMF using another concept, which is to incorporate the periodicity property of distinct source sounds into the separation framework.More specifically, PC-NMF considers the encoded matrix  as the time vectors and uses the nature of periodical differences to separate the biological sounds.Because heart sound and lung sounds are different in periodic characters (heart rate and respiration rate are very different), the mixed heart-lung sound is separated through a PC-NMF model, as will be presented in Section 4.

B. Deep Autoencoder (DAE)
The DAE has two components, an encoder  • and a decoder  • .Figure 1 shows the architecture of a DAE(C) model.Consider the encoder and decoder to have K E and K D layers, respectively, the total number of layers in the DAE is K All = K E + KD.The encoder encodes the input x to the middle latent space  (   ), and the decoder reconstructs the input by ( ).The reconstructed output  is expected to be approximately equal to x.The mean squared error (MSE) is generally used to measure the difference between  and .Minimizing the value of MSE is the goal to train the DAE model.As mentioned earlier, by using fully connected and fully convolutional architectures, we can build DAE(F) and DAE(C), respectively [51][52][53].Fig. 2 shows the neuron connections of the k-th and (k+1)-th layers for the two types of DAE.Fig. 2(a) presents the fully-connected layer, where each neuron in the (k+1)-th layer is fully-connected with all neurons in the k-th layer.Fig. 2 (b) and (c), respectively, present the convolutional and deconvolutional connections, where each neuron in the (k+1)-th layer is partially-connected with the neurons in the kth layer.As can be seen from Fig. 2(a), the DAE(F) forms the encoder and decoder by fully-connected units, which is shown in Eqs. ( 1) and ( 2),  and  represent the encoding and decoding matrix,  and  are the bias terms: where  ∈  , and  stands for the total number of neurons in the latent space.For the decoder, we have .
In DAE(C), the encoder is formed by convolutional units, as shown in Eq. (3), that executes the convolutional function  (•).Each encoded layer has J filters:  , … ,  ;  ∈  , L is the kernel size, and  is the i-th channel of  , where   , … ,  .Each neuron in the (k+1)-th layer's feature map,  , is the summation of the element-wised product of  and receptive field of all previous feature maps  by convolution operation, and  denotes the bias term.The corresponding convolution operation is shown in Fig. 3 (a).The decoder is formed by a deconvolutional unit, as shown in Eq. ( 4).During deconvolution, all of the k-th layer's feature maps  first go through the zero-padding and then deconvolution processes (with function  • ).Each decoded layer has J filters:  , … ,  ;  ∈  , L is the kernel size, and  is the i-th channel of  , where   , … ,  .Each neuron in the (k+1)-th layer,  , is the summation of the element-wised product of  and receptive field of all previous feature maps  by deconvolution operation, and  denotes the bias terms.The corresponding deconvolution operation is shown in Fig. 3 (b).
where  is the j-th feature map in the k-th layer, and I is the total number of channels.For the decoder, we have

𝒙 𝜎 𝐹 𝑾 , 𝒍 𝒃
where  denotes the total number of layers in the DAE(C).With the trained DAE, the periodic analysis is applied to the latent representations to identify two disjoint portions of neurons corresponding to heart and lung sounds.The basic concept is to consider the temporal information of different periodic sources.Moreover, to classify the temporal information by periodicity, the coded matrix is transformed into periodic coded matrix P via modulation frequency analyzer (MFA).Here, we adopted the discrete Fourier transform (DFT) to perform MFA.The periodic coded matrix presents clear periodicity characteristics.Because heart sound and lung sound have different periodicity, the coded matrix can be separated to heart coded matrix and lung coded matrix from the whole encoded matrix, P. Afterwards, each source coded matrix is transformed by the decoder and reconstructed to obtain the LPS sequences of the separated heart sound  and lung sound  .The output LPS features are then converted back to waveform-domain signals by applying inverse short-time Fourier transform (ISTFT).

A. Periodic Analysis Algorithm
In this section, we present the details of the MFA.Fig. 4 illustrates the overall PC-DAE framework.First, we train a DAE(F) or DAE(C) model with the encoder and decoder as shown in Eqs. ( 1) and ( 2) or Eqs. ( 3) and ( 4), respectively.Then, we input the sequence of mixed heart-lung sounds, X, to obtain the latent representations.The collection of latent representations and the time sequence are the matrix L={ ,  , …  }.Thus, we obtain where  ∈  , j is the neuron index, where 1  M, and n is the time stamp, where 1  N, and N is the total number of frames.We assume that among the latent representations, some neurons are activated by heart sound and the others activated by lung sounds.Based on this assumption, we can separate mixed heart-lung sounds in the latent representation space.To determine whether each neuron is activated either by heart or lung sound, we transpose the original L to obtain   (T denotes matrix transpose).Thus, we obtain where With  , we intend to cluster the entire set of neurons into two groups, one group corresponding to heart sounds and the other to lung sounds.More specifically, when pure heart sound is inputted to the DAE, only one group of neurons corresponding to the heart sounds is activated, and the other group corresponding to the lung sounds is deactivated.When the pure lung sound is inputted to the DAE, on the other hand, the group of neurons corresponding to the lung sounds is activated, and the other group corresponding to the heart sounds is deactivated.The strategy to determine these two groups of neurons is based on the periodicity of heart and lung sounds.
Algorithm 1 shows the detailed procedure of periodic analysis.To analyze the periodicity of each submatrix  , we form the periodic coded matrix   , … ,  , … ,  by applying the MFA on  , as shown in Eq. (7).
When we used DFT to carry out MFA, we have  ∈  / , and  can be clustered into two groups.There are numerous clustering approaches available, and we used the sparse NMF clustering method to cluster the vectors in P into two groups [55].Eq. ( 8) shows the clustering process by NMF, which is also achieved by minimizing the error function.On the basis of the largest score in the encoding matrix,  , of the transposed , the clustering assignment of  can be determined.
where  represents the cluster centroids,   , … ,  , … ,  represents the cluster membership,  ∈  , k is set as the cluster amount of the basis,  represents the sparsity penalty factor, || After obtaining the coded matrix of each source, we decode it as Eqs.( 9) and (10).
In the proposed approach, we compute the ratio mask of these two sounds, which are defined as Eqs.( 11) and (12).
With the estimated  and  , we obtain the heart LPS  and lung LPS  by Eqs. ( 13) and (14).
where ⊙ denotes the element-wise multiplication.Then  and  along with the original phase are used to obtain the separated heart and lung waveforms.

A. Experimental setups
In addition to the proposed PC-DAE(F) and PC-DAE(C), we tested some well-known approaches for comparison, including direct-clustering NMF (DC-NMF), PC-NMF, and deep clustering based on DAE (DC-DAE).The PC-NMF and PC-DAE shared a similar functionality where the PC-DAE performs clustering on the latent representations for heart and lung sound separation.For a fair comparison, the DC-NMF, PC-NMF, and DC-DAE implemented in this study are carried out in an unsupervised manner.For all the methods, the mixed spectrograms were used as the input, and the separated heart and lung sounds were generated at the output.
The DAE(F) model consisted of seven hidden layers, and the neurons in these layers were 1024, 512, 256, 128, 256, 512, and 1024.The encoder of the DAE(C) model consisted of three convolutional layers.The first layer had 32 filters with a kernel size of 1 4, the second layer had 16 filters with a kernel size of 1 3, and the third layer had 8 filters with a kernel size of 1 3 of the encoder.The decoder comprised of four layers.The first layer had 8 deconvolutional filters with a kernel size of 1 3, the second layer had 16 deconvolutional filters with the kernel size of 1 3, the third layer had 32 deconvolutional filters with a kernel size of 1 4, and the fourth layer had 1 deconvolutional filter with kernel size of 1 1.Both convolution and deconvolution units adopt a stride of 1.The rectified linear unit were used in encoder and decoder, and the optimizer was Adam.The unsupervised NMF-based methods were used as the baseline, where the basis number of NMF was set to 20, and the L2 norm was used as the cost function.The NMF approach first decomposes the input spectrogram V into the basis matrix W and the weight matrix H, where W serves as the sound basis (including both heart and lung sounds), and H are the weighting coefficients: where  is the ij-th component of V (a matrix that contains multiple sound sources) and  and  are the ia-th component of W and the ai-th component of H, respectively.For unsupervised source separation, the weighting coefficient matrix  is clustered into several distinct groups.When performing separation, the target source of interest can be reconstructed by using the group of vectors in  that corresponds to the target source.Because the clustering is directly applied to the weighting matrix, we refer to this approach as DC-NMF as the first baseline system.Rather than directly clustering, the PC-NMF [49] clusters the vectors in H based on the periodicity of individual sound sources; the PC-NMF was also implemented as the second baseline.
Recently, a deep clustering technique [56] that combines a deep learning algorithm and a clustering process has been proposed and confirmed effective for speech [45] and music [46] separation.The fundamental theory of deep clustering is similar to DC-NMF as the clustering is applied in the latent representations instead of the weighting matrix.Because the deep-learning models first transform the input spectrograms into more representative latent features, the clustering of latent features can provide superior separation results.In this study, we implement a deep clustering approach as another comparative method.We used the model architecture of DAE(C) as the deep-learningbased model when implementing the deep clustering approach; hence, the approach is terms DC-DAE(C).
For all the separation methods conducted in this study, we can obtain separated heart and lung sounds.We used the pure heart and lung sounds as a reference to compute the separation performance and adopted three standardized evaluation metrics, namely signal distortion ratio (SDR), signal to interferences ratio (SIR), and signal to artifacts ratio (SAR) [57] to evaluate the separation performances.In a source separation task, there are three types of noise: (1) noise due to missed separation ( ); noise due to the reconstruction process ( ), and the perturbation noise ( ).The computations of SDR, SIR, and SAR are presented in Eqs. ( 16)- (19), where ̂  is the estimated result and   is the target.

𝑠̂ 𝑡
For all of these three metrics, higher scores indicate better source separation results.
We conducted experiments using two datasets.In the first dataset, the heart and lung sounds were collected by SAM, which is a standard equipment in teaching and learning heart and lung sounds [48].Fig. 5 shows the model of SAM.The SAM attempts to simulate the real human body and has many speakers inside its body corresponding to organ's positions.The SAM can generate clean heart sound or lung sound in different locations.We used the iMEDIPLUS electronic stethoscope [58] to record heart and lung sounds in an anechoic chamber.The heart sounds used in this experiment included normal heart sounds with two beats (S1 and S2).The lung sounds in this experiment included normal, wheezing, rhonchi, and stridor sounds.Both heart and lung sounds were sampled at 8k Hz.The two sounds were mixed at different signal to noise ratio (SNR) levels (-6 dB, -2 dB, 0 dB, 2 dB, and 6 dB) using pure heart sound as the target signal and pure lung heart sound as the noise signal.All the sounds were converted into spectral-domain by applying the short-time Fourier transform (STFT) with a 2048 frame length and 128 frame shifts.Because high frequency parts may not provide critical information for further analyses, we only use 0-300 bins (corresponding to 0-1170 Hz) in this study.

B. Latent space analysis of a selected case
In this section, we used a sample mixed sound to detail every step in the PC-DAE system.Fig. 6 shows the overall procedure of the PC-DAE, where Fig. 6  By comparing Fig. 6(f) and (g), we observe a peak in the low-frequency part in Fig. 6(g), and a peak is located at a highfrequency part in Fig. 6(f).The results suggest that these two neurons should be clustered into two different groups.We apply the same procedures (trajectory extraction and DFT) on all the neurons in the DAE.The neurons that process shorter and longer periodicity are clustered into two distinct groups.Finally, given a mixed sound, we first extract the latent representation; to extract heart sounds, we then keep the neurons that correspond to heart sounds and deactivated the neuron that corresponds to lung sounds and vice versa.
To further verify the effectiveness of the PC clustering approach, we compare DC and PC clustering approaches by qualitatively analyzing the clustering results.To facilitate a clear visual comparison, we adopted the principle component analysis (PCA) [60] to reduce the dimensions on the latent representations to only 2-D and then draw the scattering plots in Fig. 7.
The figure shows the spectrograms of two mixed heart-lungs sounds and the clustering results of latent representations.By observing Fig. 7(a), (c), and (e), we can note that heart and lung sounds showed clearly different time-frequency properties (as shown in Fig. 7(a)).In this case, both DC (as in Fig. 7(c)) and PC (as shown in Fig. 7(e)) clustering approaches can effectively group the latent features corresponding to lung and heart sounds into two distinct groups.Consequently, satisfactory separation results can be achieved for both DC and PC approaches.Next, by observing the results of Fig. 7(b), (d), and (f), since the stridor sound are highly overlapped with heart sound (as show in Fig. 7(b)), the DC clustering approach (as show in Fig. 7(d)) cannot effectively group the latent representations into two distinct groups.On the other hand, the PC clustering approach (as show in Fig. 7(f)) can successfully cluster the latent representations into two distinct groups and consequently yield better separation results.
Please note that any particular time-frequency representation method can be used to perform MFA.The present study adopts the DFT as a representative method.Other time-frequency representation methods, such as CWT [29][30][31][61] and Hilbert-Huang transform [62][63][64], can be used.When using these methods, suitable basis functions or prior knowledge need to be carefully considered.In this study, we intend to focus our attention on DFT and will further explore other time-frequency representation methods in the future.

C. Quantitative evaluation based on source separation evaluation metrics
Next, we intend to compare the separation performance using Eqs.( 9) and (10) and Eqs. ( 13) and ( 14).The results are listed in Fig. 8. Since Eqs. ( 9) and ( 10) directly estimate the hear sound and lung sounds, the results using Eqs.( 9) and (10) are termed "Direct".On the other hand, because Eqs. ( 13) and ( 14) estimate the heart and lung sounds by a ratio mask function, results are termed "Mask".We tested the performance using both PC-DAE(F) and PC-DAE(C).From the results in Fig. 8, we observe the results of "Mask" consistently outperform that of "Direct" except for heart sound's SIR of PC-DAE(F), and confirm the effectiveness of using a ratio mask function to perform separation instead of direct estimation.In the following discussion, we only report the PC-DAE separation results using the ratio mask functions of Eqs. ( 13) and ( 14).Tables 1 and 2 show the evaluation results of heart and lung sounds, respectively, tested on the proposed PC-DAE(F) and PC-DAE(C) with comparative methods.The separation performance is consistent for heart and lung sounds.From the two tables, we observe all the SDR, SIR, and SAR scores mostly increase along with increasing SNR levels.Meanwhile, we note that PC-NMF outperforms DC-NMF, and PC-DAE(C) outperforms DC-DAE(C), confirming the periodicity property to provide superior separation performance than direct clustering.Meanwhile, we observed that the deep learning-based approaches, namely DC-DAE(C) and PC-DAE(C), outperform NMF-based counterparts, namely DC-NMF and PC-NMF, verifying the effectiveness of deep learning models to extract representative features over shallow models.Finally, we observe that PC-DAE(C) outperforms PC-DAE(F), suggesting that the convolutional architecture can yield superior performance than fully connected architecture for this sound separation task.

D. Qualitative comparison based on separated waveforms and spectrograms
In addition to quantitative comparison, we also demonstrate waveforms and spectrums of a sample sound to visually compare the separation results.We selected a sample sound, which is the mixed sound with the SNR ratio of heart sound (treated as the signal) and wheezing lung sound (treated as the noise) to be 6 dB.Fig. 9 demonstrates the waveforms of the sample sound, where Fig. 9(a) shows the mixed sounds.Fig. 9(b) shows the pure heart sound (left panel) and lung sound (right panel) that have not been mixed.Fig. 9(c), (d), (e), (f), and (g) show the separated results of DC-NMF, PC-NMF, DC-DAE(C), PC-DAE(F), and PC-DAE(C), respectively.From Fig. 9, we observe that PC-DAE(C) can more effectively separate the heart and lung sounds as compared to other methods; the trends are consistent with those shown in Tables 1 and 2.
Next in Fig. 10, we show the spectrograms of the same sample sound shown in Fig. 9. Fig. 10(a) presents the mixed sounds, Fig. 10(b) shows the pure heart and lung sounds, and Fig. 10(c) to (g) are separated results.From Fig. 10(a), we can observe that the two sounds are highly overlapped in the lower frequency region.It is also noticed that PC-NMF possesses a higher performance for interference suppression during the high frequency of lung sounds, and PC-DAE(F) possesses a higher performance in overlapped frequency bandwidth and receives improved heart sound quality.PC-DAE(F) and PC-DAE(C) performed the best with minimal artificial noises.Generally  speaking, the two PC-DAE approaches outperformed the other approaches yielding clear separation spectrograms.

E. Real application in first heart sound (S1) and second heart sound (S2) recognition
We used another dataset to further evaluate the proposed algorithm in a more real-world scenario.Real mixed heart-lung sounds were collected from National Taiwan Hospital, and the proposed PC-DAE was used to separate the heart and lung sounds.Because it is not possible to access pure heart and lung sounds corresponding to the mixed heart-lung sounds, the SDR, SIR, and SAR scores cannot be used as the evaluation metrics in this task.Instead, we adopted the first heart sound (S1) and second heart sound(S2) recognition metric accuracies to determine the separation performance.We adopted a well-known S1 and S2 recognition algorithm from [10,65], which considers frequency properties and the assumption of S1-S2 and S2-S1 intervals.We believe that this alternative metric is convincing and valuable since the S1-S2 recognition accuracy has already been used as a crucial index for doctors to diagnose the occurrence of diseases [66,67].
This dataset includes 3 different age groups, namely 0-20 (childhood and adolescence), 21-65 (adulthood), and over 66 (senior citizen)).Each group has 6 cases, including 3 males and 3 females, and each case has 7 mixed heart-lung sounds (10 sec).Based on this design, we can determine whether the proposed approach can be robust against variations of age and gender groups (accordingly covering people with different physiological factors, such as blood pressure, heart rate, etc.).Table .3 shows the recognition accuracies of before and after performing heart-lung sound separation.
To visually investigate the S1-S2 recognition performance, we present the waveforms along with the recognition results in

V. CONCLUSION
The proposed PC-DAE is derived based on the periodicity properties of the signal to perform blind source separation in a single-channel recording scenario.Different from the conventional supervised source separation approach, PC-DAE does not require supervised training data.To the best of our knowledge, the proposed PC-DAE is the first work that combines the advantages of deep-learning-based feature representations and the periodicity property to carry out heart-lung sound separations.The results of this study indicate that the proposed method is effective to use a periodic analysis algorithm to improve the separation of sounds with overlapped frequency bandwidth.The results also show that PC-DAE provided satisfactory separation results and achieve superior quality as compared to several related works.Moreover, we verified that by using the proposed PC-DAE as a preprocessing step, the heart sound recognition accuracies can be considerably improved.In our current work, we need to define how many sources are in the signal.However, in most cases, determining the exact number of the sources is difficult.Hence, identifying on effective way to determine the number of the sources is an important future work.In the present study, we consider the condition where only sounds recorded by an electronic stethoscope is available.We believe that this experiment setup is close to most real-world clinical scenarios.In the future, we will extend the proposed PC-DAE to the conditions where additional physiological data is available, such as ECG, photoplethysmogram, and blood pressure signals.

Fig. 2 .
Fig. 2. Relation between hidden layers in a fully connected layer, convolutional layer, and deconvolutional layer.

Fig. 3 (
Fig. 3 (a) Convolutional and (b) deconvolutional operations.III.THE PROPOSED METHOD The proposed PC-DAE is a DAE-based unsupervised sound source separation method.When performing separation, the recorded sounds are first transformed to spectral-domain and phase parts via short-time Fourier transform (STFT).The spectral features are converted to log power spectrum (LPS) [52], where   , … ,  , … ,  denotes the input, and N is the number of frames of X.Then the DAE encodes the mixed heartlung LPS by E(•) to convert  to the matrix of latent representations,   , … ,  , … ,  .The decoder, D(•), then reconstructs the latent representations back to original spectral features.The back-propagation algorithm [54] is adopted to train the DAE parameters to minimize the MSE scores.Because the input and output are the same, the DAE can be trained in an unsupervised manner.
(a) and (b) show the spectrograms of pure heart and lung sounds, respectively.Fig.6(c) shows the latent representation extraction process.For demonstration purpose, we selected two specific neurons, one corresponding to heart sounds and the other corresponding to lung sounds, and plotted their trajectories along the time axis in Fig.6(d) and (e), respectively.By evaluating Fig.6(d) and (e), we first perceive that the periodicity properties of Fig.6(d) and (e) aligned well with Fig.6(a) and (b), respectively.Meanwhile, we observe different trajectories of these two neurons, and the periodicity of heart sound is different from lung sound.Next, we applied the DFT on the trajectories of Fig.6(d) and (e) and obtained Fig.6(f) and (g), respectively, to capture the periodicity more explicitly.Notably, the x-axis for Fig.6(a), (b), (d), and (e) is time (s), while the x-axis of Fig.6(f) and (g) is frequency (Hz).In the temporal signal analysis, the signals in Fig.6(f) and (g) are termed MFA[59] of Fig.6 (d) and (e) .As can be seen by converting the trajectory into the modulation domain, the periodicity can be more easily observed.

Fig. 6 .
Fig. 6.Analyses of latent representations of sample sounds.(a) and (b), respectively, are the spectrograms of the pure heart and lung sounds, the x-axis is time (s) and y-axis is frequency (Hz); (c) presents the latent representation extraction based on the DAE model; (d) and (e) are trajectories of two latent neurons, where the x-axis is the time, and the y-axis is activation value

Fig. 7 .
Fig. 7. Spectrograms of two mixed heart-lung sounds and the clustering results of latent representations.(a) and (b) are the spectrograms of two mixed heart and lung sounds; (c) and (d) are the DC clustering results of the latent representation; (e) and (f) are the PC clustering results of the latent representation.

Fig. 8 .
Fig. 8. Average separation results over different SNR conditions.(a) and (c) show the heart sound separation results using PC-DAE(C) and PC-DAE(F), respectively; (b) and (d) show the lung sound separation results using PC-DAE(C) and PC-DAE(F), respectively.

Fig. 9 .Fig. 10 .
Fig. 9.The waveform of a mixed sample.The y-axis is the amplitude of the signals, and the x-axis is time index (s).From (b) to (g), the left and right panels are heart sound and lung sound, respectively.

Fig 11 .
Fig 11.The waveforms of two sound samples and the corresponding S1-S2 recognition results.(a) a mixed heart-lung sound with normal heart sound and normal lung sound.(b) a mixed heart-lung sound with abnormal heart sound and abnormal lung sound.(c) and (d) are the separated results corresponding to (a) and (b), resepctively.The recognized S1 and S2 results are colored by green and red symbols, respectively.

Fig. 11 .
Fig 11 (a) and (b) are two sound samples, where Fig. 11 (a) is the mixed heart-lung sound with normal heart and lung sounds, and Fig. 11 (b) is the mixed heart-lung sound with abnormal heart sound (weak periodicity) and abnormal lung sound (rhonchi).Fig 11 (c) and (d) show the S1-S2 recognition the after performing heart-lung sound separation corresponding to Fig 11 (a) and (b), respectively.From Fig. 11 (a) and (b), we can note that the S1-S2 recognition results are poor for the mixed sounds, and the recognition performance are notably improved with the separated heart sounds (as can be seen from Fig 11 (c) and (d)), confirming the effect of the proposed PC-DAE's outstanding capability of separating the heart sounds from mixed sounds.
• || represents the L1-norm, and ∥ • ∥ represents the Frobenius distance.On the basis of the  of encoding matrix  , the clustering results   , … ,  , … ,  is determined by the largest score of  .In this case,  ∈ heart, lung , and the cluster results assign to  .According to the assigned clustering result,  is separated to  and  by deactivating the submatrices which do not belong to the target, respectively.

Table 1
Evaluation results of separated heart sounds generated by the proposed PC-DAE(F) and PC-DAE(C) comparing to three conventional approaches in terms of SDR, SIR, and SAR.Avg denotes the average scores over five SNRs.

Table 2
Evaluation results of separated lung sounds generated by the proposed PC-DAE(F) and PC-DAE(C) comparing to three conventional approaches in terms of SDR, SIR, and SAR.Avg denotes the average scores over five SNRs.

Table 3
Recognition accuracies of mixed heart-lung sounds and separated heart sounds with different age and gender groups.