Separation of the Aortic and Pulmonary Components of the Second Heart Sound via Alternating Optimization

An algorithm for blind source separation (BSS) of the second heart sound (S2) into aortic and pulmonary components is proposed. It recovers aortic (A2) and pulmonary (P2) waveforms, as well as their relative delays, by solving an alternating optimization problem on the set of S2 sounds, without the use of auxiliary ECG or respiration phase measurement data. This unsupervised and data-driven approach assumes that the A2 and P2 components maintain the same waveform across heartbeats and that the relative delay between onset of the components varies according to respiration phase. The proposed approach is applied to synthetic heart sounds and to real-world heart sounds from 43 patients. It improves over two state-of-the-art BSS approaches by 10% normalized root mean-squared error in the reconstruction of aortic and pulmonary components using synthetic heart sounds, demonstrates robustness to noise, and recovery of splitting delays. The detection of pulmonary hypertension (PH) in a Brazilian population is demonstrated by training a classifier on three scalar features from the recovered A2 and P2 waveforms, and this yields an auROC of 0.76.


I. INTRODUCTION A. MOTIVATION
Cardiovascular diseases are the leading cause of death worldwide [1].Among the different cardiovascular diseases, those associated with pulmonary hypertension (PH) represent a significant burden for healthcare systems because of difficulties in the early recognition of their symptoms as well as the complexity and cost of the corresponding treatments [2].
The associate editor coordinating the review of this manuscript and approving it for publication was Kin Fong Lei .
Direct measurement of pulmonary artery pressure (PAP) enables the diagnosis of PH and heart failure, but a non-invasive gold standard for direct measurement does not currently exist.Direct measurement via Swan-Ganz right heart catheterization or a CardioMEMS sensor implant [3] is invasive, expensive, and only suitable for patients with a high probability of PH.Recent research efforts for non-invasive estimates of pulmonary pressures use Doppler echocardiography.While echocardiography is relatively low cost and has minimal risk, it presents significant limitations for PAP estimation.Echocardiography requires specialized equipment and a trained clinician.PAP estimates cannot be obtained via echocardiography in approximately 50% of patients with normal PAP, nor in 10-20% of patients with elevated PAP, nor in patients with absence of tricuspid regurgitation [4].Furthermore, Doppler-based PAP estimates present an average error of 30% when compared with the ''gold standard'' offered by right heart catheterization [5].Moreover, patients receive Doppler PAP estimates when they have been referred to echocardiography and these patients typically already present significant symptoms of PH.Cardiac auscultation with a digital stethoscope offers a more timely way to perform PH screening, such as in pointof-care settings where PH exists but is asymptomatic.Digital cardiac auscultation is lower cost test with minimal training requirements and wide applicability.
The second heart sound (S2) in particular can be analyzed to estimate the PAP.The S2 sound is formed by the vibrations that occur as a result of the aortic and pulmonary valve closures, and the subsequent vibrations that occur in the cusps and in the walls and blood columns of the vessels and ventricles.The S2 sound can be modeled as a summation of A2 and P2 components.The A2 component represents sound emitted by the aortic valve closure, and the P2 component represents the pulmonary valve closure.In medical literature, the audible delay between the onset of the A2 and P2 components is known as splitting [6].The delay increases during inspiration and decreases during exhalation due to respiration's effect on thoracic pressure [7], and analysis of splitting is clinically relevant for diagnosis of diseases and heart conditions including hypertension.The manual detection of PH via traditional cardiac auscultation is highly dependent on the expertise and subjective evaluation of the operator, where the tasks are to first recognize subtle differences in amplitude and clicks of the A2 and P2 components, and then decide if the spacing and amplitude is out of normal range.Fig. 1 visualizes an example human S2 sound.
Evidence from the literature suggests that accurate estimation of the PAP can benefit by computer methods to separate the A2 and P2 components of the S2 sound [8].From a signal processing perspective, the separation of A2 and P2 components is challenging due to the overlap of signals in the time and frequency domain, and due to the difficulty in attributing the observed sound to the heart valve that caused it.Moreover, standard methods for blind source separation, such as independent component analysis (ICA) [9] and morphological component analysis (MCA) [10], impose an assumption of independence between components that does not model the conditional dependence of the A2 and P2 components on the splitting delay caused by respiration.
In this work, we propose a novel algorithm for the separation of the A2 and P2 components of S2 heart sounds.The main outputs of the algorithm, visualized in Fig. 1, are the A2 and P2 waveforms and the splitting delay.The main contributions of the paper are: 1) The proposal of a novel, non-parametric and unsupervised S2 separation algorithm that uses alternating optimization to separate a collection of S2 sounds into a proposed unshifted A2 waveform and unshifted P2 waveform (see Fig. 1b), and a set of splitting delays and corresponding shifted waveforms (see Fig. 1c); 2) The analysis of the proposed S2 separation algorithm and its comparison with the state-of-the-art separation approaches on a set of synthetically generated heart sounds; 3) The application of the proposed separation algorithm to real-world cardiac auscultation data to detect PH and extract relevant features.The remainder of this paper is organized as follows.Related works are considered in Section I-B.Section II contains the description of the proposed approach for S2 source separation.Section III empirically validates the separation algorithm on synthetic signals.Section IV applies the proposed method to real-world heart sounds for pulmonary hypertension detection.Finally, the obtained results are discussed in Section V and conclusions are drawn in Section VI.

B. RELATED WORK
We identify four different classes of approaches that have appeared in the literature in recent years to separate S2 sounds into their components: (1) standard BSS algorithms that do not account for specific characteristics of the S2 sound; (2) ad hoc, predetermined waveforms; (3) signal decompositions; and (4) methods based on splitting statistics/dynamics.
A first class of approaches uses standard blind source separation (BSS) algorithms to detect A2 and P2 components of S2 sounds [11], [12].ICA [9] assumes that the components constituting the observed signal are statistically independent and synchronized in the time domain.These two assumptions do not fit well with the underlying generation of the S2 sound.For instance, the onset of P2 can be inferred through its conditional dependence on the onset of A2 and the respiratory phase.A general mathematical model of shifted ICA [13] can model splitting statistics.In this work we leverage some tools from shifted ICA to simultaneously capture the A2 and P2 waveforms and their delays.Explicit separation of the A2 and P2 components with BSS is an open problem.
A second class of approaches models the A2 and P2 components via an ad hoc pre-determined waveform generating function, whose parameters are fitted to the observed S2 data [4], [14], [15], [16], [17], [18], [19].For instance, transient nonlinear chirp models were developed as a result of analysis of isolated A2 and P2 components extracted from heart sounds recorded in pigs [4], [14], [20].Information regarding the concentration of the components around their instantaneous frequencies was obtained from the visual inspection of the signal's Wigner-Ville distribution.A fully automatic approach to separating the S2 sound into its components considered Gaussian chirplets [15] fitted to the observed S2 signal using a statistically optimal null filter [16].Similarly, [17] uses the continuous wavelet transform to decompose the observed S2 sounds using a nonlinear transient chirp as a mother wavelet.The A2 and P2 components have also been modeled with Gabor wavelets [19].More recently, the A2 and P2 components have been modeled via Gaussian windowed sinusoids, whose parameters (amplitude, location, and width of the Gaussian windows as well as sinusoids frequency and initial phase) are obtained by solving a least-squares optimization problem [18].By design, the methods in this class of approaches are limited by the need to define a hand-crafted waveform or waveform generating function.Such ad hoc approaches are not flexible enough to model the natural variation of S2 sounds.
A third class of approaches uses signal decomposition as a way of retrieving A2 and P2 components [21], [22], [23], [24].A Fourier transform of manually cropped regions of the S2 was used to analyze A2 and P2 components [25].The continuous and discrete wavelet transforms have been used to visually represent A2 and P2 components and manually find their relative delay by visual inspection [21], [22].The Hilbert transform has also been utilized, along with a wavelet transform, for the analysis of the A2 and P2 splitting delay [24].A slightly more automated method for detecting peaks of the A2 and P2 components and their relative delay uses the Hilbert vibration decomposition (HVD) followed by an analysis of the smoothed pseudo Wigner-Ville distribution [23].However, this work still uses visual inspection to choose a number of components retained from the HVD.
A fourth class of approaches has recently emerged to provide fully automatic separation of S2 into A2 and P2 components, via analysis of splitting delays due to respiration.In particular, the work by Tang et al. [26] exploits an assumption that, across many heartbeats, the A2 component occupies the same position in the S2, whereas P2 component experiences a relative shift according to the respiration phase.Therefore, the A2 components are estimated by averaging the collected S2 sounds and the P2 components are obtained by subtracting the derived A2 component from the S2 sound.Secondly, the assumption that A2 is obtained by averaging is a restrictive assumption that makes the algorithm sensitive to alignment of S2 sounds.Other work attempts to recover the A2 and P2 components using a joint multivariate Gaussian mixture model (GMM) from S2 sounds [27].However, this approach is a supervised machine learning approach that requires the availability of ground truth annotated A2 and P2 components.An annotated dataset of A2 and P2 sounds is difficult or impossible to collect on human subjects or in standard clinical practice because it is not clear how exactly to separate the superimposed A2 and P2 components, and the model was not tested on real-world data.
The proposed blind source separation algorithm overcomes shortcomings of the surveyed literature by extending previous preliminary results presented in [28] to separate the A2 and P2 components of the S2.While ad hoc pre-determined waveform restrictions limit the utility of some of the approaches previously presented in the literature, the proposed method does not pre-define the waveform.Some approaches study splitting statistics to obtain A2 and P2, but they require visual inspection and manual human intervention.In contrast, the proposed method is fully automated.Some of the fully automated approaches suffer from the lack of available data; the proposed method is unsupervised and requires only S2 heart sound data.

II. PROPOSED METHOD
The proposed algorithm recovers the aortic (A2) and pulmonary (P2) components as well as their relative delays from a set of S2 heart sounds.The overall approach and its underlying modeling assumptions are discussed in detail.Fig. 2 visually summarizes the approach applied to Initialize P2 component estimates (See Eq. 4, 5, and 6)  real-world data, and Algo. 1 presents the proposed method as an algorithm.

A. SIGNAL MODEL AND SETUP
We are presented with the observation of a series of M S2 sounds obtained from a single heart sound recording of a given patient.It is assumed that suitable pre-processing has been applied to first extract and align the S2 sounds in the heart sound audio recording.An example preprocessing approach is described in Section IV-D.The observed S2 signals are assumed to be sampled with period T s .Moreover, N samples are collected for each S2 sound, where N depends on the number of heartbeats in the recording.For example, Fig. 1 depicts M = 5 recorded S2 sounds, each of N = 200 samples, and sampling period of T s = 1 millisecond.Assumptions and Motivation: It is assumed that each observed S2 sound can be described as a sum of recovered A2 and P2 components, or informally s(•) = a(•) + p(•).As a blind method, it is assumed that no prior information is disclosed about the shape of the waveforms a(•) and p(•).It is also assumed that the stethoscope and subject do not move relative to each other during recording and that the volume of the S2 signal is approximately unchanged during recording.The model used to recover A2 and P2 components is motivated by the observation that respiratory activity modulates the morphology of S2 sounds through changes in the pleural pressure and the pulmonary blood flow.Namely, at late inspiration and early expiration, the relative time distance (i.e., the split) between the A2 and P2 components is increased due to the elevated pressure in the right ventricle, whereas the morphology of different S2 sounds is kept approximately constant during apnea1 [7].For this reason, we assume that the A2 and P2 components in different S2 sounds of the same recording, respectively, have the same shape but different delays.The observed S2 sounds are modeled as follows: where m = 1, . . ., M indexes the heart sounds of the patient, t = 0, . . ., N − 1 indexes samples of the signal, and the τ •,m terms identify delays (offsets) from the start of the S2 sound.
Objective: The objective of the considered source separation method is to recover the waveforms a(•) and p(•) and the delays τ A,m and τ P,m from the observation of the S2 signal s m (tT s ), for all m = 1, . . ., M observed S2 sounds.
The method proposed for separating the A2 and the P2 components of the recorded S2 sounds formulates a least-squares optimization problem whose solution is approximated via an alternating iterative approach.The considered optimization problem is expressed as follows: which aims to minimize the mean-squared error (MSE) between the S2 observations and the superposition of the separated A2 and P2 components.This objective assumes that the A2 and P2 sounds in different heartbeats have the same respective shapes but different delays.
After an initialization step, the optimal solution (2) will be approximated by an alternating iterative algorithm which alternates between finding the sets of delays { τA,m }, { τP,m } that minimize the MSE with a given pair of waveforms â(•), p(•), and alternately finding the waveforms â(•), p(•) that minimize the MSE for given sets of delays { τA,m }, { τP,m }.As it will be shown in the remainder of this section, the proposed alternating optimization approach simplifies the problem in (2) and outputs an inferred shape of the waveforms associated with the A2 and P2 components as well as their corresponding delays in each heartbeat.

B. INITIALIZATION
The initial estimates of waveforms and delays associated to A2 and P2 components are obtained by applying the method described in [26], which computes an estimate of the A2 component by simply averaging the observed S2 sounds.This method assumes the A2 components appear with the same delay in each observed S2 sound, while the contributions of the corresponding P2 components have different delays in each S2 sound, and are zeroed out by averaging.Therefore, we can write the initial estimate â(0) (tT s ) as: where we have introduced the use of the superscript (•) (i)  to denote the estimate at the i-th iteration of the algorithm.
According to the model used in [26], the delays associated to the A2 components are the same in all S2 sounds.Therefore, without loss of generality, we can assume in our initialization that τ (0) A,m = 0 for all m = 1, . . ., M .The initial values p(0) (•) and τ (0) P,m are obtained by a heuristic approach in three steps.First, an unaligned P2 representation is obtained by subtraction in Eq. ( 4) for all M delays.Second, delays τ (0) P,m are obtained in Eq. ( 5) by finding the position in time of the maximum cross correlation between the P2 representations of the first heartbeat (m = 1) and each of the M heartbeats.Third, the P2 waveform is obtained in Eq. ( 6) by averaging an aligned P2 representation: where ⊗ is the cross correlation operator, and where Eq. ( 4) and Eq. ( 5) are evaluated for all m ∈ {1, . . ., M }.

C. DELAY ESTIMATION
Given the estimates â(i) (•), p(i) (•) of the A2 and P2 waveforms obtained from the i-th iteration of the algorithm, the corresponding delay estimates { τ (i) A,m }, { τ (i) P,m } are obtained as the solution of the following problem: where optimization can be performed independently for the different indexes m.In addition, it is possible to express an optimization problem equivalent to (7) in the frequency domain, by leveraging Parseval's identity, which says that the integral of the square of a function in the time domain equals the sum of squares of the function's Fourier coefficients.We denote with S m (fF s ), Â(i) (fF s ), and P(i) (fF s ) the discrete Fourier transform (DFT) of time-domain signals s m (tT s ), â(i) (tT s ), and p(i) (tT s ), respectively, where F s = 1 N DFT T s and N DFT ≥ N is the number of data points used to compute the DFT.Then, the frequency-domain delay optimization problem for a given index m can be written as: where the gradient with respect to [ τ (i) A,m , τ (i) P,m ] can be expressed in closed form.Based on this observation, the delay estimates { τ (i) A,m }, { τ (i) P,m } are obtained by approximating the solution of (10) via projected gradient descent.At each gradient descent iteration, the delays are projected to a physiologically feasible range.The time split between A2 and P2 components usually ranges between 10 ms and 50 ms [26], and we therefore constrain delay values in the interval [−100, 100] ms.

Given the current delay estimates τ (i)
A,m , τ (i) P,m , and the observations s m (tT s ), the signal waveforms associated to the (i + 1)-th iteration, â(i+1) (•), p(i+1) (•) are obtained again by minimizing the MSE, that is, they are given by the solution the following optimization problem: Assuming that the admissible delay estimates are always obtained as multiples of the sampling period T s , rather than interpolated between samples, the problem in (11) can be cast as a standard linear least-squares problem, which is overdetermined when M > 2. The problem can be expressed in matrix form as follows: where the column vector s ∈ R M •N concatenates all N samples of all M observed S2 sound signals of a given recorded signal, the column vector a ∈ R N contains the samples of the estimated signal a(•), and the column vector p ∈ R N contains the samples of the estimated signal p(•).
The binary ''shift'' matrix X (i) ∈ {0, 1} M •N ×2N is formed by M × 2 blocks of dimension N × N , each containing a ''shifted version'' of the N × N identity matrix, where the values of such shifts depend on the delays τ (i) A,m and τ (i) P,m .
Namely, we can write where j,m ∈ {0, 1} N ×N for j = {A, P} and the element in the k-th row and ℓ-th column of the matrix T (i) j,m is given by for j = {A, P} and m = 0, . . ., M − 1, where δ k,ℓ is the Kronecker delta symbol.When the matrix X (i) has full column rank, the corresponding linear least-squares problem (12) has a unique solution which is given by where (•) † represents the Moore-Penrose pseudoinverse.
After I steps, the obtained unshifted waveforms a I and p I , such as those shown in Fig. 1b, are subsequently obtained in shifted form in the next sub-section.

E. TERMINATION
The algorithm is terminated after a fixed number I of iterations of alternating delay and waveform estimation processes.After the last iteration, the recovered waveforms a I and p I are combined with the delays [τ A,m ] and [τ P,m ] to obtain one shifted waveform for each delay: ) In Eq. 17, the recovered shifted P2 waveforms p(tT s − τ P,m ) * are not obtained by simply considering p(I) (tT s − τ (I ) P,m ), but rather from the difference of the shifted A2 from the S2 signal.This choice has precedent in the literature, where existing approaches obtain the P2 via subtraction [4], [26].

III. NUMERICAL RESULTS WITH SYNTHETIC DATA
This set of experiments examines the performance of the proposed algorithm when compared to the unsupervised approach of [26], and the supervised approach of [27] for the separation of synthetic S2 sounds.The three evaluated methods share a similar set of assumptions regarding the variability of A2 and P2 components in different S2 sounds from a same heart sound recording, thus allowing a meaningful and fair comparison.
Subsection III-A evaluates how well the A2 and P2 waveforms are recovered as the difficulty of the separation task increases.Subsection III-B evaluates the robustness to additive white Gaussian noise.Subsection III-C evaluates how well the predicted splitting delays correspond to the true splitting delays from the observed synthetic S2 signals.

Generation of Synthetic S2 signals:
The numerical results shown in this section are obtained by considering synthetic signals generated using the model described in [18].In particular, the S2 sounds are obtained as the superposition of A2 and P2 components modeled via Gaussian windowed sinusoids.The model is parameterized with respect to the frequency and phase of the sinusoidal signals as well as the amplitude, location, and width of the Gaussian window functions.These parameters are sampled from distributions obtained by fitting phonocardiogram (PCG) recordings obtained from a population of 150 subjects with age above 40 that were referred to a coronary computed tomography angiography with a low or intermediate pre-test probability for coronary artery disease (see [18] for details).
Proposed Model Hyperparameters: In all the reported results, the proposed alternating optimization method is run with a fixed number of iterations, I = 10, and the delay optimization step consists of 10 gradient descent iterations with step size γ = 10 −12 applied to frequency-domain signals computed over N DFT = 2048 samples.The method in [27] is trained using 100 randomly generated heart sounds, each containing 40 heartbeats.The separation results for the three considered methods are averaged over 200 randomly generated heart sound recordings, each containing 40 heartbeats.

A. SEPARATION ACCURACY IN THE PRESENCE OF BIAS
A series of tests of increasing difficulty are conducted to evaluate how well the synthetic A2 and P2 waveforms are recovered from observed S2 signals.In the real world, some patients can have wider splitting delays between A2 and P2 components than others [6], [7].When the splitting delay is small, the A2 and P2 components overlap and the separation problem is more difficult.To better understand how the models might be biased by real-world variability across patients, it is reasonable to test recovery when the splitting delay is smaller versus when it is larger.Secondly, within any given patient, inhalation typically increases the splitting delay and exhalation decreases the delay.We assume that, over a set of S2 sounds, the delays of S2 and P2 components, respectively, are uniformly distributed.To simultaneously capture statistics of inhalation within any given patient's S2 sound recording and expose bias to the natural variation of splitting found across different patients, a maximum delay max ∈ {10, 15, 20, . . ., 80} ms between the A2 and P2 components is introduced.For each maximum splitting delay max , a dataset of S2 sounds is generated by sampling splits from the uniform distribution U(0, max ) ms.These tests are intended to capture splitting characteristics across patients, while assuming uniformly distributed splits within any patient.
The empirical measurement describing how well the A2 and P2 components are recovered is determined by normalized root mean-squared error (NRMSE), defined as where a(•) and p(•) are the ground truth synthesized waveforms and a(•) * and p(•) * the predicted waveforms, and τ A , τ P , τ * A and τ * P are each vectors of length M containing the respective ground truth or predicted delay values.Note that τ * A is a zeros vector.Fig. 3 visualizes the NRMSE values of the proposed separation algorithm and the methods in [26] and [27] under varying max .As expected, all methods show that NRMSE error increases as the separation task is made harder by greater overlap of the A2 and P2 components.The proposed approach recovers the synthetic A2 and P2 waveforms more precisely than the methods in [26] and [27] for all considered A2-P2 splits.Even when then the signals are largely overlapping in the time domain, where the separation problem is designed to be more difficult, the proposed method reconstructs waveforms with the lowest error (e.g. when max = 20 ms, NRMSE is less than 0.2).The improvement over the related methods can be explained by observing that a key assumption in these baseline methods is relaxed in this work.Namely, for small values of max , the assumption that expected value of the P2 is approximately a flat signal is too restrictive.For instance, Eq. ( 19) is explicitly defined in [26] in order for their method to obtain the A2 component by averaging the S2; our approach initializes A2 by The proposed method is better for SNR > 16 dB.Error increases as noise makes the signal increasingly random.The proposed unsupervised method can be outperformed by the supervised method [27] after significant noise corruption, and outperforms [26].
averaging, but then modifies it with alternating optimization.The delays obtained via the proposed approach increase flexibility and degrees of freedom of accurately recover the A2 and P2 waveforms.

B. ROBUSTNESS TO NOISE
Heart sound recordings are often subject to noise.A series of experiments test robustness to progressively increasing amounts of noise.In particular, when independent additive white Gaussian noise (AWGN) distorts the A2 and P2 components of the synthetically generated S2 sounds at different signal-to-noise ratio (SNR) levels, then the S2 sounds of different heartbeats are less well modeled by shifted versions of the same waveforms.Noise therefore challenges the key modeling assumption that the waveform is static across heartbeats.Fig. 4 reports the NRMSE obtained with the considered separation algorithms for different SNR values and for max = 40 ms.As shown in Fig. 4, the proposed approach outperforms the method in [26] for all considered SNR values greater than zero.The proposed approach does not require the A2 sounds must be perfectly aligned in all the observed S2 recordings, hence its superior performance.The proposed unsupervised method outperforms the supervised GMM-based separation method in [27] for levels of noise below approximately 16 dB, while the supervised approach gives empirically lower separation errors when there is significant noise (SNR values below 16 dB).Note that the considered GMM-based method requires access to separated A2 and P2 components for training, and this data is difficult or impossible to collect in real-world human heart sound data and the technology for practical application of the supervised method does not yet exist.With similar reasoning, continuing research on noise cancellation approaches would improve the observed SNR, 34638 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and this serves to benefit the proposed algorithm over other methods.

C. RECOVERY OF A2-P2 SPLIT DELAYS
Time delays between the A2 and P2 components of an S2 sound are clinically relevant, and estimation of delays should be accurate and minimally biased.We compare recovered time delays * m produced by the considered separation algorithms to corresponding ground truth time delays m on a dataset of M = 120000 synthetic heart sounds.
All delays are computed via the same procedure.Given as input the A2 and P2 components of either a ground truth signal or recovered signal, a delay is computed as the time difference between the main peak of the P2 component and that of the A2 component.For each component, its main peak is computed using the homomorphic envelogram [29] of the component [26].A homomorphic envelogram of a signal is a strictly positive representation of the signal that smooths out splits and serrated peaks in the signal, and it is obtained by applying a low pass filter to the natural logarithm of the component and then exponentiating.For instance, the homomorphic envelogram of the P2 component is h P,m (tT s ) = exp(L(ln(p(tT s − τ P,m )))) for a low pass filter L(•).In Eq. ( 20), the main peak is obtained as the weighted sum of the time instants, with weights given by the normalized homomorphic envelogram.Denoting h A,m (tT s ) and h P,m (tT s ) for the homomorphic envelograms of the A2 and P2 components of the m-th recorded S2 sounds, the ground truth A2-P2 split is computed as . Digiscope Collector technology used to record the heart sounds [33] in the considered real-world dataset.
Given a vector of M ground truth delays , and M predicted delays * , the normalized mean squared error is The scatter plots in Fig. 5 compare the ground truth delays of the A2-P2 splits as obtained from the generative signal model to the corresponding estimates obtained from the source separation algorithms.The proposed separation method has the lowest error.It has an NRMSE of 0.05 compared to 0.16 for the algorithm in [27] and 0.26 for that of [26].It also has the lowest R 2 coefficient of determination.

IV. APPLICATION TO REAL-WORLD DATA
We also tested the proposed separation algorithm on realworld data, in order to assess its potential in a real, clinical environment.In this scenario, in the presence of only PCG recordings, ground truth values for the waveforms associated with the A2 and P2 components of S2 sounds are not directly available.In the absence of separated components for training, the supervised GMM-based separation algorithm of [27] cannot be considered for comparison in this section.Moreover, the separation performance of the proposed method is gauged indirectly by evaluating the statistical significance for detecting PH of features extracted from the resulting separated A2 and P2 components.
The features analyzed are chosen to reflect the observation that specific characteristics of the S2 have been linked to the pulmonary blood pressure.In particular, features extracted from separated A2 and P2 components, as the time split, peak-to-peak ratio, energy ratio, etc., are widely recognized as potential indicators of PH [5], [30], [31].In fact, the analysis of the S2 has been studied as a potential way to provide non-invasive estimates of the PAP [4], [32].

A. MATERIALS
The heart sounds used for this study were collected in the Real Hospital Português, Brazil, using a Littmann 3200 electronic stethoscope, and the DigiScope Collector software [33] (Fig. 6).Auscultation was performed over the second left intercostal space, collecting sounds of duration approximately 60 seconds in a quiet environment.All collected heart sounds were collected at 4 kHz and the amplitudes were quantized to 8-bit resolution.The heart sounds were anonymized and analyzed in Portugal with the approval of Real Hospital Português and the Ethics Committees of University of Porto, Portugal (approval reference number 391.391 of September 6th, 2013).
The considered dataset comprises a total of 43 heart sound recordings collected from 43 adult patients, 11 of which were diagnosed with PH, while the remaining 32 were not diagnosed with PH.

B. PRE-PROCESSING
The raw PCG signals were first filtered with high-pass and low-pass Butterworth filters of order two with cut-off frequencies equal to 25 Hz and 400 Hz, respectively, in order to keep the main spectral information associated to both A2 and P2 components while rejecting out-of-band noise.Then, the spike removal algorithm presented in [34] was applied to the filtered signals to address ambient noise, and then each recording was normalized to zero mean and unit variance.

C. SEGMENTATION
The pre-processed PCG signals were segmented to determine the exact localization of S2 sounds using the combined convolutional neural network (CNN) and hidden Markov model (HMM) approach presented in [35].However, further manual correction was applied, due to the presence of segmentation errors induced by variable heart rates and due to the presence of significant systolic murmurs in some of the patients affected by PH.

D. S2 ALIGNMENT
The S2 segments extracted from each PCG recording are first aligned, in order to allow the use of the initialization procedure described in Section II-B and in order to facilitate the extraction of the A2 and P2 components.Alignment is performed in two stages: (i) coarse alignment and (ii) fine alignment.
Coarse alignment is obtained by computing first the homomorphic envelogram [36] of the PCG to determine the time intervals of S2 sounds where the majority of the energy is contained.The homomorphic envelograms of the different S2 sounds extracted from a given PCG are aligned using a cross-correlation analysis.In particular, for each of the S2 sounds, the alignment shift was obtained as the lag index that maximizes the correlation between the first S2 sound of the recording and the given S2 sound within a search window of 200 ms centered around the zero-lag index.
Then, fine alignment is obtained by applying the approach again, and directly to the filtered PCG signal of the coarsely aligned S2 segments.Also in this case, the lag index that maximizes the cross-correlation between the first S2 in the recording and the given S2 sound is searched for within a window of 200 ms centered around the zero-lag index.Moreover, only segments corresponding to a normalized correlation coefficient greater than 0.2 are retained, in order to exclude S2 segments affected by excessive artifacts and noise levels.
Note that the proposed two-step alignment procedure mainly aims to increase robustness against noise by considering the homomorphic envelogram of the S2 sounds in the first step.Fig. 7 reports an example of S2 sounds extracted from a PCG recording of the considered dataset before and after the two-steps alignment procedure.Fig. 1 shows an example of the source separation into the A2 and P2 components.

E. DISCRIMINATING PH
First, the separation of A2 and P2 components is performed by applying the procedure described in Section II or the method described in [26] to the aligned S2 segments of a PCG recording.
Then, features are extracted from the obtained A2 and P2 components in order to discriminate heart sounds recorded from patients with and without PH.In particular, it has been observed that, most commonly, heart sounds of patients with PH present louder P2 components [31], along with larger A2-P2 time splits [4].
The features extracted from the recorded PCG signals are also extracted from the outputs of our proposed separation algorithm, and also the algorithm of [26].They are: • Mean time split: We compute the mean relative time delay between A2 and P2 components from the homomorphic envelope as it was defined in Eq. 20, • Energy ratio: This feature is obtained by computing the average over the S2 sounds contained in a PCG of the ratio between the energy of the P2 component and the energy of the corresponding S2 sounds, i.e., • Peak-to-peak ratio: This feature is obtained by computing the average over the S2 sounds contained in a PCG of the ratio between the peak-to-peak amplitude of the P2 component and of the corresponding S2 sound, i.e., Table 1 demonstrates that features derived from the proposed A2 and P2 components are predictive of PH.The first three rows verify that each of the three features independently are useful for PH via two sample Kolmogorov-Smirnov (KS) statistical significance tests.The KS tests evaluate the null hypothesis that there is no difference, given the feature and dataset, of whether a subject has PH.The KS test supports varying size samples; there are 11 PH negative and 32 PH positive samples in each significance test.The remaining rows evaluate all three features jointly with a predictive model trained to detect PH.The classifier model is a kernel ridge regression classifier with two degree polynomial kernel trained on the three features described in the table, and it was evaluated via leave one out cross validation with metrics including the area under the ROC curve (auROC) and Average Precision (AP).For comparison to the method in [26], the first two steps in Fig. 2 and the features obtained are identical for both methods, and only the obtained A2 and P2 components depend on the method.The results suggest that the proposed model is at least as accurate as the method of [26].The features analyzed verify that the A2 and P2 components obtained by our proposed method contain information useful for discrimination of PH.

V. DISCUSSION
The proposed separation method stems from the observation that heart sounds are modulated by the respiratory activity.This observation has also been made in previous state-ofthe-art approaches in [26] and [27].In particular, inspiration increases displacement between the A2 and P2 components while exhalation decreases displacement.This observation implies that averaging aligned and segmented S2 signals in order to reduce noise may introduce significant distortion to P2 components.In fact, the method in [26] relies on simple averaging to completely remove P2 components from S2 sounds, thus obtaining A2 estimates.
The proposed approach enhances this formulation with a more flexible mathematical framework to recover the A2 and P2 components.It does not require any prior information on the specific shape of such waveforms, and it does not impose the assumption that averaging removes the P2 components.The proposed method is shown to have accurate and robust separation performance, including when aortic and pulmonary components have small delay between them.
The computational complexity of the proposed alternating optimization method is mainly dominated by a pseudoinverse, which has computational complexity of O(MN 3 )2 and a frequency-domain projected gradient descent with complexity O(MN log N ).
Future work: The proposed method achieves automatic separation of previously segmented S2 sounds.Future work should automate the pre-processing steps to denoise, segment and align the S2 signals.A future adaptation of this work could improve robustness to severe noise or could consider an objective that relaxes the assumption during optimization that the waveform generating function is identical across heartbeats.

VI. CONCLUSION
We have presented a novel unsupervised blind source separation algorithm to identify the A2 and P2 components and a set of delays representative of respiration rate from S2 recordings.The proposed approach has the advantage of flexibly modeling aortic and pulmonary components without using predefined waveforms.The reconstruction of the A2 and P2 components is defined in terms of an optimization problem whose solution is approximated via alternating optimization of a least squares objective.
The proposed separation method is shown to provide accurate and robust reconstructions of the aortic and pulmonary components, and the method can outperform previous approaches in the literature that were formulated on similar premises.We report consistent improvement of 10 percent lower normalized root mean-squared error (NRMSE) over the state-of-the-art in the reconstruction of aortic and pulmonary components on a synthetic dataset.Moreover, the proposed approach has higher coefficient of correlation (R 2 ) and lower NRMSE than comparable related works when recovering delays between components on a synthetic dataset.The proposed method allows for the extraction of discriminative features from A2 and P2 components for detection of pulmonary hypertension from real-world heart sounds, and on the evaluated dataset, the features were shown give superior classification performance to a comparable competing method.
The proposed approach needs only segmented S2 heart sounds; it does not require a respiration rate measurement, annotations of the A2 and P2 components, or other auxiliary data.The presented separation method establishes the foundation for a series of improvements that will be the focus of future research.

FIGURE 1 .
FIGURE 1.The novel unsupervised algorithm analyzes S2 sounds (1a) to recover A2 and P2 waveforms (1c, left) and associated time delay (1c, right).The proposed alternating optimization method obtains a single P2 and A2 waveform for all heartbeats (1b), and the shifted P2 in Fig.1c.The shifted P2 is obtained as the difference of the A2 waveform from the recorded S2 sound for each heartbeat.

FIGURE 2 .
FIGURE 2. Application to real-world data: Recovery of A2 and P2 components and splitting statistics via an unsupervised alternating optimization algorithm.Our main contribution is the alternating optimization step.

FIGURE 3 .
FIGURE 3. Recovers synthesized A2 and P2:The proposed algorithm is consistently best at recovery, for all tested max , including when the A2 and P2 components have small delay between them.The x-axis represents S2 sounds with wide or narrow splitting (i.e., large or small max , respectively) between A2 and P2 signals.All models favor wide splitting.

FIGURE 4 .
FIGURE 4. Robust to distortion by noise:The proposed method is better for SNR > 16 dB.Error increases as noise makes the signal increasingly random.The proposed unsupervised method can be outperformed by the supervised method[27] after significant noise corruption, and outperforms[26].

FIGURE 5 .
FIGURE 5. Superior recovery of split delays between the synthesized A2 and P2 components.Compared to approaches[26] (left) and[27] (middle), the proposed method (right) has largest R 2 correlation and lowest error.Each plot visualizes a scatter plot of 120k samples.Red line is ideal.

FIGURE 7 .
FIGURE 7. Example S2 sounds extracted from a PCG recording of the considered real-world dataset before (left) and after (right) the alignment procedure.Periodicity of respiratory phase is visible in the aligned S2 sounds.

TABLE 1 . Meaningful features for PH detection on real world data are
[26]ined via our proposed method.The first three rows of the table verify that the features extracted from proposed A2 and P2 components of the S2 sound are each independently discriminative of PH.The values in these first three rows are p-values corresponding to two sample Kolmogorov-Smirnov tests of PH and not-PH patients.The remaining rows report predictive performance of a PH classifier trained jointly on the three features.Columns compare the algorithm presented in[26]with the proposed method.