Exploring the Impact of Signal Quality Enhancement on Heart Sound Classification Models

Limited cardiology resources increase the urgency for automated heart disease screening for the general public. Heart sound diagnostic models have been recently employed as a cost-effective solution for the initial screening of heart disease. Noise in heart sound recordings, however, can reduce the performance of such data-driven models. Various quality enhancement approaches have been adopted to alleviate the destructive impact of noise on model performance. One approach is universal noise reduction which applies denoising techniques to recordings, irrespective of their noise level. The second approach is targeted noise reduction, which applies denoising solely to recordings deemed to need it, based on an assessment of signal noise level. The third approach is filtering where instead of noise reduction, the quality of recordings is assessed and the signals falling below a minimum threshold of quality are discarded. This study aims to understand which quality enhancement approach leads to a more accurate heart sound classification. We developed multiple data-driven models using different classifiers and feature representations and analyzed the impact of quality enhancement on the accuracy of those models. The results indicate that noise reduction is associated with an overall performance drop in classification models. We observe that both universal and targeted noise reduction have a destructive impact on models’ performance. However, filtering improves the accuracy of the models, in particular, for the clinically important abnormal class. The findings of this study can be leveraged to inform the design decisions for the pre-processing of heart sound recordings and consequently optimize downstream classification performance.


I. INTRODUCTION
Cardiovascular diseases are the leading cause of death in the world.Around 18 million people die of heart disease every year, which accounts for one-third of deaths globally [1].

Most cardiovascular diseases are manageable if diagnosed
The associate editor coordinating the review of this manuscript and approving it for publication was Essam A. Rashed .
early [2].However, about 75% of heart disease-related mortalities occur in regions of the world where access to medical facilities with advanced diagnostic tests is limited [1].As a result, there is an urgent need for low-cost and pervasive diagnostic systems that can be used for initial heart disease screening.
Heart sound signals contain valuable information that can reveal a whole host of pathologies.While specialized diagnostic systems may not be accessible to the general public, heart sounds can be easily captured using digital stethoscopes or mobile phone devices.Therefore, automatic analysis of heart sounds using data-driven classification systems has been an active area of research in recent years, with the prospect of providing wide-reaching heart diagnostic services at a lower cost.A wide variety of data-driven models have been proposed to distinguish normal from abnormal heart sounds [3], [4], [5], [6], [7], [8].Classification models have also been developed to identify specific types of heart abnormalities [9], [10], [11], [12].Such diagnostic models can be used as pre-screening tools for heart disease, especially in situations where access to trained medical professionals is limited.
Heart sound signals are typically captured using digital stethoscopes.Mobile phones have also recently been employed as heart sound acquisition devices [13].While such devices generally have in-built noise reduction technologies [14], [15], they can still capture a significant amount of noise, especially in uncontrolled environments where ambient noises are typically present.Even in controlled environments, artifacts with an internal source like cough or digestive sounds can interfere with heart sounds.Therefore, capturing noise-free heart sound signals in real-world settings is very difficult, if not impossible.This is less of a problem when clinicians examine a patient as they can recognize and filter noise [16].It can, however, pose a significant challenge for automated diagnostic systems.
Noise and disturbances can mask heart sounds and consequently change their morphological characteristics [17], [18].As a result, they can reduce the accuracy of data-driven classification models [19].One quality enhancement approach that has been frequently employed to mitigate the negative impact of noise on heart sound signals is universal noise reduction.Universal noise reduction aims to reduce the noise in captured signals by applying denoising techniques to recordings, irrespective of their noise level.Such denoising techniques include a range of methods, from simpler ones like applying low-pass or band-pass filters [3], [4], [20], to more sophisticated methods such as wavelet analysis denoising [19], [21], [22], [23].However, it has been shown that universal noise reduction leads to information loss in signals and subsequently reduces the accuracy of heart sound classification models [24].
The second approach is targeted noise reduction, where recordings are first assessed in terms of quality.Noise reduction techniques are then applied solely to recordings deemed to need it, while recordings with an acceptable noise level remain intact.This approach allows for minimizing the information loss caused by applying noise reduction techniques to acceptable-quality recordings, as occurs in universal enhancement.
The third quality enhancement approach is filtering.Unlike previously mentioned approaches, filtering does not apply noise reduction techniques to the recordings.Instead, poorquality signals are distinguished from acceptable-quality ones using signal processing or machine learning techniques [18], [25], [26], [27], [28].In other words, prior to classification, recordings are assessed up front: the ones that do not meet a minimum threshold of quality are discarded while recordings deemed as acceptable quality proceed through to the classification model.
While various approaches have been adopted in the field to enhance the quality of heart sound recordings, it has remained unexplored how they influence the performance of heart sound classification models.Gaining a deeper understanding of the impact of quality enhancement on data-driven models' performance allows us to make informed design decisions for pre-processing of heart sound recordings -and thus optimize downstream diagnostic models' accuracies.Therefore, in this study, we investigate which heart sound quality enhancement approach is more effective in reducing the misclassification rate of the classification models.We first validate the results of the previous work regarding the negative impact of universal noise reduction on the accuracy of heart sound classification models.Then, we explore if either targeted noise reduction or filtering approaches can be more effective than universal noise reduction in improving the accuracy of models.We develop multiple classification models using different feature representations and classifiers.The models are then evaluated using recordings enhanced by the aforementioned quality enhancement approaches.This experiment enables us to understand how each approach influences the accuracy of data-driven models.
Through experimental analysis, it is demonstrated that both universal and targeted noise reduction approaches are associated with a performance drop in classification models.Also, it is shown that filtering the recordings that fall below a minimum threshold of quality is a more effective quality enhancement approach compared to noise reduction.The findings of this study contribute towards informing the design decisions for the pre-processing of heart sound signals and optimizing the accuracy of heart sound classification models.
The remainder of this paper is structured as follows: Section II presents the background, which includes an overview of the heart sound enhancement techniques.Section III details the dataset, quality enhancement approaches, and data-driven classification models employed in this study.In Section IV, results are provided and discussed.Conclusions and future directions are presented in Section V.

II. BACKGROUND A. NOISE REDUCTION
Various methods have been proposed to improve the input signal by reducing noise in heart sound recordings.One of the most widely used techniques is filtering.Lowpass [20] and band-pass [3], [4] filters have been used to remove high-and low-frequency noises from heart sound signals.The limitation of this technique is that in some cases, the frequency range of heart sounds can overlap with noise frequencies.In such cases, filtering may remove information salient for accurate diagnostics from heart sound recordings [29].
Wavelet analysis is another widely used technique for denoising of heart sound signals.Messer et al. [21] have stated that wavelet coefficients of heart sounds are much larger than that of noise, and as a result, coefficients smaller than a specific level could be considered noise and discarded.Ali et al. [30] tried to determine optimal wavelet denoising parameters for heart sound recordings.They evaluated different wavelet families and decomposition levels on recordings from the Pascal heart sound dataset [31] and achieved the best denoising results using the Daubechies 10 wavelet family at the 4th decomposition level.Selection of the optimal decomposition level and threshold value can largely affect the performance of wavelet denoising.Therefore, some studies have tried to determine the values of these parameters adaptively.Vaisman et al. [23] proposed an adaptive wavelet denoising method for fetal heart sounds.In this work, the wavelet decomposition level was determined adaptively based on specific heart sound features.Jain and Tiwari [19] proposed a wavelet denoising method with adaptive thresholding for heart sound signals.In this method, the threshold value was estimated based on the statistical parameters of the signal.
While noise reduction algorithms can reduce the noise in heart sound signals, applying such techniques universally irrespective of the noise level of signals can be harmful.Information that is salient for accurate classification can be removed, and this information loss can reduce the performance of classification models employing such signals [24].To address this challenge, we recently proposed a quality assessment and enhancement pipeline for heart sound recordings captured using mobile phones [32].This pipeline includes multiple stages in which the quality of the recordings is assessed, and noise reduction is applied to specific signals deemed to need it.A group of clinicians was surveyed in order to evaluate the proposed pipeline's impact on the diagnosability of heart sound recordings.However, the impact of the pipeline on the classification models' performance was not explored.

B. FILTERING
Several methods have been proposed that use signal processing or machine learning techniques to analyze the quality of recordings and filter out poor-quality signals.Unlike noise, which has a random nature, heart sounds are periodic.Some studies have leveraged this property of heart sounds to detect parts of the recordings that are noise-free or less noisy than the other parts.Li et al. [25] proposed a method to detect a sub-sequence in heart sound signal with the highest quality.They used the degree of periodicity as a quality score to detect clean parts of the recordings.Kumar et al. [18] proposed a method to detect parts of a recording that are free of ambient and internal body noises.Their algorithm first detects a small part of the recording, which shows periodicity in both time and frequency domains, and then uses that as a reference to distinguish noise from heart sounds.This method achieved 96% and 98% sensitivity and specificity, respectively.The methods mentioned above are not applicable to recordings with strong continuous noise as there may be no sub-sequence of acceptable quality.
Some studies have developed data-driven models to classify heart sound signals based on their quality.Springer et al. [26] computed nine signal quality indices for each heart sound signal and used those as input features to train a logistic regression model to distinguish goodfrom poor-quality recordings.Their model achieved an accuracy of 82% on heart sound signals captured using a mobile phone and 87% on signals captured using a digital stethoscope.Grooby et al. [27] proposed a data-driven signal quality assessment method for neonatal heart sounds.They extracted 186 statistical, time-and frequency-domain features from heart sound recordings and used those to train four different classifiers.They evaluated their method using a local heart sound dataset and achieved an accuracy of 93%.Tang et al. [28] extracted ten features, including but not limited to the degree of periodicity, kurtosis, and energy ratios, and employed those to train a support vector machine (SVM) classifier.This method was evaluated using a large dataset, including 7893 recordings collected from publicly available datasets, and was able to classify heart sounds into acceptable and unacceptable classes with an overall accuracy of 94%.Although some data-driven heart sound classification models have achieved acceptable accuracies, we should note that they generally need a large amount of annotated data to train with, which could be hard to obtain.Also, classifying recordings into good-and poor-quality as a mechanism for handling input quality may not be suitable where the signals are being captured in an environment or way that gives consistently noisy signals, as recordings will simply be categorized as poor-quality and consequently discarded.

III. APPROACH AND METHODS
The goal of this study is to investigate the impact of quality enhancement on the performance of heart sound classification models.To remove the bias towards any specific feature representation or classifier, we create multiple classification models using different feature representations and classifiers.To understand which quality enhancement approach is more effective in reducing the misclassification rate of the classification models, we explore three scenarios: • Universal noise reduction: Noise reduction is applied to all recordings irrespective of their noise level and classification models are evaluated using the denoised recordings.
• Targeted noise reduction: A heart sound quality metric is used to identify recordings deemed to need denoising.Noise reduction is then applied to those recordings and denoised recordings are used to evaluate the classification models.
• Filtering: The quality of recordings is first assessed using multiple heart sound quality metrics.The models are then evaluated using recordings that meet a minimum threshold of quality.By comparing the evaluation results of the models in these three scenarios with their respective baselines where models are evaluated on original unenhanced recordings, we determine the most effective quality enhancement approach.
In Section III-A, we provide the details of the quality enhancement approaches.Section III-B details the dataset used in our experiments.In Section III-C, we give the details of the developed classification models.

A. QUALITY ENHANCEMENT APPROACHES 1) UNIVERSAL NOISE REDUCTION
This approach applies a noise reduction algorithm to all recordings in the test set, irrespective of their noise level.For this purpose, we use a wavelet denoising technique.We chose wavelet denoising as it has been widely used in the field as a noise reduction technique for heart sound data.Table 1 summarizes the parameters of the wavelet denoising method which have been adapted from [30].
TABLE 1. Parameters and their corresponding values for wavelet analysis denoising algorithm (adapted from [30]).

2) TARGETED NOISE REDUCTION
As shown in Fig. 1, targeted noise reduction applies a denoising algorithm to recordings of the test set deemed to need it while the recordings with acceptable quality (clean recordings) remain intact.For denoising, we use the same algorithm which is used for universal noise reduction (Section III-A1).
To identify the candidate recordings for noise reduction, we use a heart sound quality metric called frequency band ratio [28] which represents the ratio of the frequency band magnitudes in the lower frequency range (between 24 Hz and 200 Hz) and the magnitudes of all frequency bands of the signal.Unlike noise with a wide frequency range, heart sounds are low-frequency sounds [28].Therefore, the frequency band ratio can show the level of noise in the recordings.The value of this metric would be larger in the case of higher-quality heart sound recordings and smaller for noisier recordings.A recording is categorized as clean if its frequency band ratio is equal to or higher than 0.45.Otherwise, it is classified as noisy, and noise reduction is applied to such signals.To determine the above threshold, the value of the frequency band ratio metric was computed  for all recordings in the training set, and the median value for this metric was identified.Using the median value of the frequency band ratio metric as the quality threshold allows us to split the data into two balanced sets where the split above the threshold contains higher quality recordings on average compared to the split below the threshold.Such a data split enables us to limit the application of targeted noise reduction to noisier recordings.

3) FILTERING
In this approach, we do not apply noise reduction techniques to recordings and instead, we discard the recordings which fall below a minimum threshold of quality.Fig. 2 illustrates the pipeline for the filtering approach.
As shown in Fig. 2, first, the duration of the recordings is determined: recordings equal to or longer than 5 seconds are acceptable, while recordings shorter than 5 seconds are rejected.This threshold for the length of recordings was determined based on a survey we carried out with a group of clinicians regarding the impact of noise and degradations on the diagnosability of heart sound signals [32].Most survey respondents (92%) stated that a diagnosable heart sound recording must include at least six heartbeat cycles.At the same time, according to [33], the average duration of a heartbeat cycle is 0.8 seconds which means that a 5-second recording contains six heartbeats on average.The impact of signal duration on the performance of heart sound classification models was also investigated in our recent study [34].The results of this study showed that short-duration recordings should be discarded as they can reduce the accuracy of classification models.
Afterward, we use two signal quality metrics including frequency band ratio (as discussed in Section III-A2) and periodicity [35] to identify the recordings that are too noisy or do not contain heart sounds.The degree of periodicity shows how periodic the signal is.According to Li et al. [35], a heart sound signal with a lower noise level has a greater degree of periodicity.Therefore, we can use this metric to detect nonperiodic signals.
A recording is categorized as acceptable quality if the following condition holds: • Degree of Periodicity ≥ 2.0 AND Frequency Band Ratio ≥ 0.3 Otherwise, the recording is discarded and will not be presented to the model for classification.
The thresholds for these two metrics were determined by performing a subjective listening test and analyzing its results.To carry out the test, first, the aforementioned metrics were computed for recordings of the training set.Then, 93 recordings were selected from the training set.The choice of these recordings was based on the values of the frequency band ratio and periodicity metrics to ensure having a set of samples with a diverse range of quality.Finally, we listened to selected recordings and categorized them into three groups based on their quality: a) Good quality, B) Borderline quality, and C) Poor quality.Then, the distributions of these three groups of samples were analyzed.Based on the results of this test, we chose thresholds that give us a good separation between poor-quality and good-quality recordings.In other words, the chosen thresholds allow us to identify and consequently discard most of the poor-quality recordings while retaining the majority of good-quality ones.
Fig. 3 illustrates phonocardiogram (PCG) and Logspectrogram of an acceptable-quality heart sound recording while Fig. 4 depicts PCG and Log-spectrograms of poorquality recording.

B. DATASET
We use PhysioNet heart sound dataset [36] to train and evaluate the classification models.The PhysioNet dataset was published in 2016 as part of the PhysioNet/CinC computing in cardiology challenge.In the last few years, the PhysioNet has been used as the gold standard dataset to develop and evaluate heart sound classification models.This dataset contains 3240 recordings from healthy and pathologic subjects with a sampling frequency of 2000 Hz.Of these 3240 recordings, 2575 belong to the normal class and 665 recordings were labeled as abnormal.The normal recordings were captured from healthy subjects and the abnormal ones were collected from patients with a confirmed cardiac diagnosis.The recordings last from 5 seconds to just over 120 seconds.Some recordings of this dataset were labeled as unsure, which means that the annotators were not sure about the label of those recordings.In this study, we excluded those recordings from the dataset.The recordings were captured in clinical and non-clinical environments.Therefore, many of them are corrupted by various noise sources, such as speech, stethoscope motion, breathing, and intestinal activity.PhysioNet includes a variety of normal and abnormal heart sounds contaminated with a wide range of noise and disturbances with various intensities.These characteristics make PhysioNet a good choice for our experiments.The recordings of this dataset were randomly apportioned into training (63%), validation (7%), and test sets (30%).

C. CLASSIFICATION MODELS
Fig. 5 illustrates the pipeline for training and evaluation of the heart sound classification models.As shown in Fig. 5, this pipeline includes three components: Pre-processing, feature extraction, and classification.Following, we provide a summary of the role and functions of these components.

1) PRE-PROCESSING
The first step of pre-processing is fixed-length segmentation.In this step, the recordings are split into 5-second segments.The last segment of each recording is also kept only if it has a duration equal to or longer than 2.5 seconds.Segments that are shorter than 5 seconds are zero-padded for convolutional neural network (CNN) models to ensure inputs to these models are of the same lengths.The train, validation, and test sets include 8400, 933, and 3967 segments, respectively.To prevent data leakage, all segments of each recording are placed either in train, validation, or test sets.Then amplitude normalization is applied to all segments.Lastly, quality enhancement (details in Section III-A) is applied if appropriate.It should be noted that this study aims to evaluate the influence of applying quality enhancement at the inference phase on the performance of the classification models.Therefore, the quality enhancement is only applied to recordings in the model evaluation phase while in the training phase, this step is ignored.

2) FEATURE EXTRACTION
After pre-processing the recordings, Linear-and Mel-scaled Short-Time Fourier Transform (STFT) features are extracted from signals.STFT is the most widely used time-frequency feature representation for heart sound classification [37].Then, spectrogram representations are computed from STFTs.We call these features Log-and Mel-spectrogram.Librosa python library [38] was used to extract these features from recordings.Window length, hop length, and the number of Mel bands were fixed at 256, 128, and 128, respectively.The spectrograms are then normalized using Z-score normalization.To reduce the computational cost of training the support vector machine (SVM) models, the average values of the features across the time axis are computed for these models (as in [39]).

3) CLASSIFICATION
The extracted features are used to train and evaluate the classifiers.Two different classifiers are employed: SVM and CNN.Both classifiers have been frequently used in the field to develop heart sound classification models.Given that two different features are extracted (Log-spectrogram and Mel-spectrogram), four different classification models are developed: Log-SVM, Mel-SVM, Log-CNN, and Mel-CNN.
For SVM models, we use the implementation of the SVM as provided in the Scikit-learn library [40] with default hyperparameters.The SVM models are trained on the training set and evaluated on the test set.
For CNN models, we use the same architecture as in [34].For the purpose of clarity, we provide a summary of the architecture.Fig. 6 depicts the architecture of the CNN model with the Mel-spectrogram as input.As shown in Fig. 6, this architecture includes three convolutional layers followed by max-pooling layers.Also, to reduce overfitting, each max-pooling layer is followed by a dropout layer with a rate of 0.5.After convolutional and max-pooling layers, a fully connected layer with 100 neurons is used, which is followed by a dropout layer with a rate of 0.5.The final layer of this architecture is Softmax which outputs the probability distributions of the potential outcomes (normal or abnormal).For each of the three convolutional layers, the number of kernels is fixed at 16, 32, and 64, respectively.The stride is fixed at 1 for all convolutional layers.Also, kernel sizes of (3,3) and (2,2) are used for convolutional and max-pooling layers, respectively.ReLu is used as the activation function for convolutional and fully connected layers.We used Tensorflow deep learning library to implement the CNN models.
To train CNN models, Adam optimization [41] with a learning rate of 0.001 and cross-entropy objective function are used.Training is performed using the training set for a maximum of 100 epochs.To avoid overfitting, the training process is stopped if validation loss does not decrease for 10 consecutive epochs.The models are then evaluated on the test set.This process is repeated ten times, and average and standard deviation values are reported for each evaluation metric.

4) EVALUATION METRICS
The performance of the models is measured using two metrics.These metrics have been frequently used in the field to measure the accuracy of heart sound diagnostic models.The first one is recall which is used to quantify the performance of models across each class and calculated using the following formula: where TP, and FN are the number of true positives and false negatives in the test set, respectively.The second metric is unweighted average recall (UAR) [42] which is used to measure the overall performance of the models and is calculated using the following equation: where Recall i is the recall for ith class and N c stands for the number of classes which is two (normal and abnormal).

IV. RESULTS AND DISCUSSION
In this section, we analyze the evaluation results of classification models for three scenarios: universal noise reduction, targeted noise reduction, and filtering.For each scenario, the evaluation results are compared with the baseline where the models are evaluated on unenhanced recordings.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.using the entire denoised test set.Table 2 compares the recall of normal and abnormal classes as well as the UAR of models evaluated on the denoised test set with their respective baselines where models were tested on unenhanced recordings.The UAR results have also been illustrated in Fig. 7.As shown in Fig. 7, applying universal noise reduction to recordings reduces the overall performance of all models with respect to their baselines.Fig. 8 compares the class-level results for universal noise reduction scenario with their respective baselines.We can see that for all models, universal noise reduction reduces the recall of both normal and abnormal classes.This drop in recall is slightly larger for the abnormal class than the normal class.Although applying noise reduction to heart sound recordings reduces the noise in signals, it can also potentially attenuate murmurs and extra heart sounds which are indicative of abnormality [24].As a result, some abnormal heart sounds can be misclassified as normal which would in turn lead to a higher misclassification rate for the abnormal class.It is worth mentioning that the cost of misclassification is higher for the abnormal class than the normal class, as a potential pathology would remain undiagnosed.
Universal noise reduction is the standard approach for reducing the noise and artifacts in heart sound recordings.Various techniques, such as applying low-pass or band-pass filters [3], [4], [20], and wavelet denoising algorithms [19], [21], [22], [23] were employed to reduce or eliminate noise in recordings and consequently increase the performance of data-driven models.However, the aforementioned results suggest that applying such techniques universally and irrespective of the noise level of recordings can have a negative impact on the overall performance of classification models.This finding is in line with the results of another study [24] in which the impact of universal noise reduction on heart sound classification models was explored, and it was shown that direct application of such techniques to heart sound signals leads to a reduction in the accuracy of data-driven models.

B. SCENARIO 2: TARGETED NOISE REDUCTION
Targeted noise reduction was applied to recordings of the test set which deemed to need it.Out of the 3967 recordings, 1976 were denoised and 1991 recordings remained unenhanced.Models were evaluated on the denoised recordings.Table 3 compares the recall of normal and abnormal classes as well as the UAR of models with their respective baselines.The UAR results have also been illustrated in Fig. 9.As shown in Fig. 9, targeted noise reduction reduces the UAR values of all models.As we can see in Table 3, similar to universal noise reduction, the drop in recall is slightly larger for the abnormal class than the normal class.These results suggest that noise reduction is not beneficial, even if we only apply that to noisier recordings of the test set and keep acceptable-quality recordings unenhanced.

C. SCENARIO 3: FILTERING
The quality of recordings in the test set was assessed using multiple heart sound quality metrics.Out of 3967 recordings, 678 were rejected (reject set), and 3289 recordings were determined as acceptable quality.Fig. 10 compares the class distribution of the recordings (in percentage) in the entire test set with that of the reject set.We can see that the class distribution is exactly similar in these two sets.The models were evaluated on acceptable-quality recordings.The third section of Table 2 summarizes the recall for normal and abnormal classes as well as the UAR of the models.Fig. 11 compares the UAR values of the models with their respective baselines where the models were evaluated on the unenhanced test set.As shown in Fig. 11, filtering the recordings that do not meet a minimum threshold of quality leads to a slight increase in the overall performance of all models.As summarized in Table 2, SVM and CNN models show a 3% and 2% increase in UAR values compared to their baselines, respectively.Class-level results also show that filtering does not change the recall of normal class in three out of four models (except for the Log-SVM model).The recall of the abnormal class, however, increases for all models.This increase in the recall is 7% and 5% for SVM and CNN models respectively.These results indicate that rejecting the recordings that fall below a minimum quality threshold can increase the overall accuracy of the heart sound classification.As mentioned earlier, in the filtering scenario, recordings that do not meet a minimum quality threshold are discarded and as a result, are not presented to classification models.This means that the models are evaluated using a smaller set of recordings (3289 recordings) compared to the baseline where the same models are evaluated on the entire test set (3967 recordings).In other words, in the filtering scenario, the models are evaluated using recordings of better quality on average compared to the baseline.Therefore, it is expected that the models achieve higher performance than their respective baselines.However, we should note that in the filtering scenario, we reject a small percentage of the recordings that are either short-duration or too noisy.By rejecting such potentially undiagnosable recordings, we can ensure that we only present diagnostic-quality heart sounds to the models for classification into target classes.
Out of the 678 recordings in the reject set, 246 recordings (36%) were rejected due to noise, 379 (56%) due to duration and 53 (8%) due to both noise and duration.To investigate if applying noise reduction to filtered recordings is beneficial or not, we also applied noise reduction to the recordings of the reject set and evaluated the models using the denoised recordings.We observed that the models performed worse on the denoised recordings compared to the unenhanced ones, similar to what we saw in scenarios 1 and 2.
As mentioned in Section II-B, filtering is a heart sound quality enhancement approach that instead of applying denoising techniques to recordings, aims to distinguish acceptable-quality recordings from poor-quality ones.Various filtering methods have been proposed based on signal processing [18], [25] or data-driven quality classification [26], [27], [28].The results show the potential of filtering techniques in reducing the misclassification rate of heart sound classification models.

D. OVERALL ANALYSIS
Noise reduction has been employed as a standard approach for quality enhancement of heart sound recordings in the field.The results of our experiments, however, indicate that denoising the recordings in both targeted and universal noise reduction scenarios impairs the accuracy of data-driven models.This drop in accuracy is more significant for the abnormal class, which is the more important class for the heart sound classification task.We applied a noise reduction technique (wavelet analysis denoising) to all recordings of the test set to enhance their quality and consequently improve the performance of models.However, we observed the opposite outcome.In other words, the misclassification rate of the models increased.We also explored whether limiting the application of noise reduction to noisier recordings and keeping acceptable quality recordings intact can be a better approach than universal noise reduction.However, the results indicate that targeted noise reduction is also harmful to the accuracy of the models, similar to universal noise reduction.In addition to noise and artifacts, denoising can potentially remove information salient to accurate diagnostics from recordings, consequently reducing the accuracy of the models.Based on this finding, we suggest investigating the impact of noise reduction techniques on the accuracy of the models before using them as a default pre-processing stage in the heart sound classification pipelines.
The results also show that assessing the quality of recordings and discarding the signals that do not meet a minimum quality threshold can improve the performance of classification models.We saw that filtering out the poor-quality recordings is a more effective approach than noise reduction in reducing the misclassification rate of the data-driven models.This can stem from the fact that, unlike noise reduction, filtering does not modify the content of the signals, and as a result, the morphological characteristics of the heart sounds remain unchanged.In situations where re-capture of the heart sounds is an option or available recordings are long enough, filtering the recordings or segments of them that fall below a minimum threshold of quality can be a more effective quality enhancement approach than noise reduction.Although the gain in overall performance (UAR) was modest, we saw between 5% to 7% increase in the recall of the clinically important abnormal class for each model, which shows the higher capability of models to detect heart problems.Based on the results of our experiments, we recommend employing filtering as an alternative approach to noise reduction in heart sound classification pipelines.
In the targeted noise reduction and filtering approaches, we used different quality assessment metrics with predetermined thresholds to identify acceptable quality recordings.However, we should note that designing a universal heart sound quality metric is out of the scope of this study.The quality thresholds that have been used in our study were determined in line with our research question and based on the characteristics of heart sound recordings in the PhysioNet dataset.As a result, they may not be necessarily applicable to other datasets and are not suggested as a universal truth for quality.

E. LIMITATIONS
In this research work, we used two classification models to explore how quality enhancement affects the performance of data-driven models.Feature representations and classifiers used in this study have been frequently employed by other researchers in the field to develop heart sound classification models.We used SVM and CNN as two representatives for traditional and deep learning classifiers.The similar trends between the results of these two models suggest that the findings of this study are independent of the employed classifiers and feature representations.However, there is still a possibility that quality enhancement affects other classification models differently.Also, as stated in Section III-A, we used wavelet analysis as the denoising method for universal and targeted noise reduction approaches.Wavelet denoising has been widely employed as a pre-processing technique to reduce the noise in heart sound recordings.We saw that applying this algorithm to heart sound signals reduces the performance of the classification models.However, other noise reduction techniques which were not explored in this study may not necessarily impact the employed classification models similarly.Lastly, as stated in Section III-C, we split the heart sound recordings into fixed-length segments to train and evaluate the classification models.In other words, recordings were not segmented into heartbeat cycles.Therefore, the findings of this study may only hold for segmentation-free heart sound classification pipelines.The impact of applying segmentation to heart sound recordings on the results of this study can be investigated in future work.

V. CONCLUSION
This study investigated the impact of quality enhancement on the performance of heart sound classification models.Three quality enhancement approaches were explored: Universal noise reduction, targeted noise reduction, and filtering.The results of our study indicate that noise reduction (universal or targeted) is associated with an overall performance drop in heart sound classification models.In this regard, we recommend avoiding applying noise reduction methods to heart sound recordings or thoroughly analyzing their impact on the models before using such techniques as part of the classification pipeline.We also observed that assessing the quality of heart sound signals and filtering out the recordings which do not meet a minimum threshold of quality is an effective approach to reducing the misclassification rate of data-driven models, and in particular for the clinically important abnormal class.Therefore, we recommend using filtering techniques as a pre-processing step in classification pipelines to improve the performance of data-driven models.The findings of this study can be used to inform the design decisions for pre-processing of heart sound recordings and consequently optimize downstream classification performance.
In this study, we used a combination of two feature representations and two classifiers to compare quality enhancement approaches.In the future, we will extend this study by exploring the impact of quality enhancement on a more extensive set of classification models.Also, we used a particular denoising technique (wavelet denoising) for universal and targeted noise reduction of the heart sound recordings.In the future, we will include a wider set of denoising techniques to investigate how the choice of noise reduction algorithm can affect the results of our study.

FIGURE 5 .
FIGURE 5. Pipeline for training and evaluation of the heart sound classification models.The quality enhancement stage is only applied to recordings in the model evaluation phase and is ignored in the model training phase.

FIGURE 6 .
FIGURE 6. Architecture of the CNN model with Mel-spectrogram as input (Mel-CNN model).

FIGURE 7 .
FIGURE 7. Universal noise reduction decreases the overall performance (UAR) of all classification models with respect to their baselines.

FIGURE 8 .
FIGURE 8. Universal noise reduction reduces the recall of the models on both normal and abnormal classes with respect to their baselines.The drop in recall is slightly larger for the abnormal class than for the normal class.

FIGURE 9 .
FIGURE 9. Targeted noise reduction decreases the overall performance (UAR) of all classification models with respect to their baselines.

TABLE 3 .
Evaluation results of the classification models in targeted noise reduction scenario and their respective baselines.

FIGURE 10 .
FIGURE 10.Reject set has a similar class distribution to the entire test set (78% of the recordings belong to the normal class and 22% of them belong to the abnormal class).

FIGURE 11 .
FIGURE 11.Filtering out the recordings that fall below a minimum threshold of quality leads to an increase in the overall performance (UAR) of all classification models with respect to their baselines.

TABLE 2 .
Evaluation results of the models in universal noise reduction and filtering scenarios and their respective baselines.