Phonocardiogram Signal Based Multi-Class Cardiac Diagnostic Decision Support System

A Phonocardiogram (PCG) signal represents murmurs and sounds signals made by vibrations caused for the period of a cardiac cycle. Acoustic wave generated through the beat of the cardiac cycle propagates through the chest wall. It can be easily recorded by a low-cost small handheld digital device called a stethoscope. It provides information like heart rate, intensity, tone, quality, frequency, and location of various components of cardiac sound. Due to these characteristics, phonocardiogram signals can be used to detect heart status at an early stage in a non-invasive manner. In previous studies, the Convolutional Neural Network (ConvNet) is the most studied architecture, which was fed by features, namely Mel Frequency Cepstral (MFC), Chroma Energy Normalized Statistics (CENS), and Constant-Q Transform (CQT). This work has proposed a ConvNet model trained by Hybrid Constant-Q Transform (HCQT) for heart sound beat classification. CQT, Variable-Q Transform (VQT), and HCQT are extracted from each phonocardiogram signal as the acoustic features, including the dominant MFCC features, feed into five-layer regularized ConvNets. After analyzing the literature in the same domain, it can be stated that this is the first time HCQT is being utilized for PCG signals. The findings of the experiments demonstrate that HCQT is more effective than standard CQT and other variants. Also, the accuracies of the system proposed in this work on the validation datasets are 96% in multi-class classification, which outperforms the proposed work relative to other models significantly. The source code is available on the Github repository https://github.com/shamiktiwari/ PCG-signal-Classification-using-Hybrid-Constant-Q-Transform to support the research community.


I. INTRODUCTION
As per the fact sheet available with WHO, CVD claims the lives of around 17.9 million people each year, and it is 31% of total death in a year, which makes CVD disease the number one cause of death. Most deaths due to CVD occur in middle and low-income countries where medical facilities are either not easily available or very costly [1]. Diagnose at an early stage is the only way to decrease the death rate due to CVD. There are many invasive and non-invasive methods to diagnose CVD. All Invasive techniques are costly, painful, and readily unavailable at all places, especially in remote areas. Usage of a non-invasive method to diagnose CVD at an early stage is less expensive and painless. ECG and PCG The associate editor coordinating the review of this manuscript and approving it for publication was Ramakrishnan Srinivasan . are two such non-invasive ways to diagnose CVD. But their analysis requires an expert doctor of this domain which is not readily available in remote areas [2]. When sounds and murmurs occur during the cardiac cycle are represented diagrammatically, it is called a phonocardiogram. These vibrations generate the wave, which propagates through the chest wall. A stethoscope, a low-cost handheld digital device, is used to record the information generated through acoustic waves. It gives us an estimation of parameters like heart rate, intensity, tone, quality, frequency, and location of various components of the cardiac sound, which helps in the diagnosis of CVD in a non-invasive manner [3]. Recent advances in computing have enabled researchers to design decision support systems that can be utilized to diagnose CVD at an early stage, even in the absence of an expert. Machine learning and deep learning algorithms have allowed us to create decision support systems that can help doctors and can also be used by laypeople in the absence of doctors [4].
The authors have proposed a hybrid constant-Q transformbased classification model to acquire more detailed information from PCG signals in this work. Acoustic features from the PCG signal are fetched to the ConvNet model for learning. The following are the key contributions of the proposed work: • Propose hybrid constant-Q transform-based (HCQT) acoustic features for PCG signals.
• Compare the HCQT features to other acoustic features and recommend the best feature set for PCG signal classification. The following is the paper's structure: Discussion of different models found in the literature for automatic diagnosis of CVD from PCG is given in Section 2. Details of sound features used with the model for classification, classifier, an insight view of the proposed model, and features of the phonocardiogram signal dataset used for the training and testing of the designed model are given in Section 3. Detail of the simulation environment and result generated through the proposed model are given in Section 4. Discussion and analysis of results are presented in Section 5. It is ended with the conclusive remarks given in Section 6.

II. LITERATURE REVIEW
An overview of different types of automatic heart disease diagnostic models from PCG signal along with datasets used and accuracy level achieved by them is given below in Table 1.
Though in the last five years, a lot of research has been carried out in designing of automatic heart disease diagnosis model from PCG signal, yet there are many more areas that are yet to be explored. It has motivated us for the proposed model given in section 3.

III. MATERIAL AND METHODS
This section presents a detailed overview of sound feature extraction methods, classification model, the dataset used, and proposed model utilized in this work.

A. MEL FREQUENCY CEPSTRAL COEFFICIENTS (MFCCs)
In audio or speech signal processing, The short-term power spectrum of sound is represented by MFC. It is based on a non-linear Mel frequency scale and a linear cosine translation of the logarithmic power spectrum. Collectively MFCCs coefficients make up MFC. The feature extraction process of MFCC is composed of the following steps [23], [ (1).
Here, the frequency term is denoted by f, while the Mel-scale frequency is denoted by M(f). 6. Discrete Cosine Transform and Log Compression: In this step, the logarithmic function IFFT is applied on filtered bank energies received in step 5. The DCT follows it. Finally, MFCC(n) is computed as shown in (2).
where MFCC(n) is the nth MFCC coefficient derived from specific audio sections using T triangular filters, and MF(t) is the t-th filter's Mel-spectrum. The heartbeat spectrogram obtained by MFCC is shown in Fig. 1. J.C. Brown, in 1988 has introduced CQT. It refers to a technique that transforms a signal from time to frequency domain. However, it is different from Fourier transformation as central frequencies are geometrically spaced, and corresponding Q-factors are equal. CQT is defined as a 1/24 octave filter bank, but it is not restricted to 24 only; it can be varied to 12, 36, or 48 bins per octave also. Unlike DFT, central frequencies of analysis are not uniformly distributed but aligned with equally tempered scale notes; this makes CQT suitable for the processing of sound [25], [26]. Furthermore, the frequency resolution of CQT has a constant Q-factor, which effectively improves resolution accuracy in low-frequency regions. Under the N-th frame of CQT, the frequency component of the K-th semitone can be stated in (3).
where Q is a constant whose value depends on the number of spectral lines of a single octave (β) The ability of the constant-Q transform to provide equal frequency support to all semitones and a variable number  of bins among them is its main advantage. However, it has drawbacks, one of which being the absence of consistent temporal resolution at lower frequencies. This trade-off can be alleviated by introducing variants of CQT i.e., VQT and HCQT. When compared to the CQT transformation, the VQT transformation provides better temporal resolution at lower frequencies. A new parameter is introduced to allow for an equitable drop of the bins' Q-factors as they approach low frequencies [27], [28].
When γ = 0, the Q-factor in the constant-Q situation is a constant. The additional parameter γ might be understood as a Hertz offset, and it is normally set to be as low as possible, e.g., around 30 Hz. Instinctively, γ has a stronger relative influence at lower frequencies where the bandwidth  is insufficient, but fades at higher frequencies. Hybrid CQT, on the other hand, is made up of two CQT varieties. In the temporal domain, the frameshift is thought to include L samples. Then, select the k c -th filter that fulfills the condition N [k c ] = 2L [29], [30].
High frequencies are those that exceed f_kc, whereas low frequencies are those that are less than f_kc. The highfrequency section of hybrid CQT uses the filter bank of the high-frequency part of CQT to filter the short-term Fourier transform-based spectrogram. The regular CQT is used directly for the low-frequency section of HCQT. In compared to CQT, HCQT is more computationally capable. A visualized comparison of the CQT, VQT, and HCQT is presented in Fig. 2.

C. CONVOLUTIONAL NEURAL NETWORK (ConvNet)
CNN has brought the revolution in the domain of computer vision. It has remarkably achieved better results than the traditional classification algorithms. Deep learning is a sub-class of machine learning which is based on Deep Neural Networks  (DNNs). Word deep indicates the presence of greater than one hidden later in neural network architecture. CNN is one such type of deep neural network, which is also known as the ConvNet model. It is made up of primarily three layers: a convolution layer, a pooling layer, and a dense layer (fully connected layer) [31], [32]. The first layer i.e., the convolutional layer, is an essential building block of ConvNet. This layer performs the mathematical operation convolution. In a continuous domain, the convolution of two functions f and g is given as in (4): In the discrete case, the same is expressed as in (5): 2-D convolution for a digital image can be extended as in (6): The function g represents a filter that is applied to the input image f in this case. 2-D convolution works by applying the convolution filter on the input image. The filter passes over several pixels, which is called a stride. At each spatial location, the convolution between the part of the image and filter is attained. The outcome is a 2-D array which is called a feature map. Softmax, Rectified Linear Unit (ReLU), Randomized Leaky ReLU, and other non-linear activation layers are used to pass this feature map. The pooling layer, also known as the subsampling layer, is another major component of ConvNet. Its purpose is to reduce the spatial size of the activation map to reduce the number of parameters needed for further processing. It applies to all feature maps on its own. Max pooling is the most effective method for the implementation of pooling.
At last, the result of the last pooling layer is received by a fully connected layer and utilized to categorize images into labels. It is the component of ConvNet where discriminative learning is performed. It behaves like a multi-layer perceptron model which can learn weights & identify image classes.

D. PROPOSED PCG SIGNAL CLASSIFICATION MODEL USING ACOUSTIC FEATURES
The offered method for phonocardiogram signal classification using ConvNet is depicted in Fig. 3. The raw data provided is in Waveform Audio File Format (WAV) format, encoding phonocardiogram signals. To pass these sound   waves to ConvNet model, these phonocardiogram signals are converted into an image, i.e. 2-D spectrogram. Spectrograms are convenient for representing these heartbeat recordings because they capture the intensity of the frequencies throughout a given sound. Thus, these spectrograms are effective representations of an audio recording. In this work, the authors have proposed the use MFCC, CQT, VQT, and HCQT based spectrograms for phonocardiogram signal classification.

E. PHONOCARDIOGRAM SIGNAL DATABASE
The authors have used the freely available open access dataset on Kaggle [33], originating through the PASCAL heart sounds classification challenge. Two datasets named A & B were generated through the PASCAL heart sound classification challenge [16]. Dataset A contains the variable-length (varying from 1 to 30 seconds) sounds recorded through a digital stethoscope in a real-time situation having background noise. Dataset A was partitioned into four classes named normal, extra heart sound, murmur, and artifact, while dataset B was partitioned into three classes: normal, extra-systole, and murmur. The authors have merged both datasets into a single dataset consisting of all five classes in this work.
The number of phonocardiogram signals in normal, murmur, artifact, extra-systole, and extrahls classes are 255, 114, 40, 37, and 16. Since the number of heartbeat signals in each class is very low, audio augmentation is performed over raw audio signals. We have applied noise injection, shifting time, varying pitch, and speed to generate augmented data for phonocardiogram signals. After audio augmentation, the number of phonocardiogram signals in normal, murmur,   artifact, extra-systole, and extrahls classes are 2555, 1146, 400, 378, and 158, respectively. The augmented dataset is partitioned into training and testing datasets with an 80:20 ratio. A spectrogram represents the PCG signal waves, as shown in Fig. (4-8), that presents five types of HCQT spectrograms for the artifact, extrahls, extra-systole, murmur, and normal in that order. Red shades described the amplitude of a PCG signal in a spectrogram. The spectrogram of a normal PCG signal is a strong sequence of amplitude, i.e., lub dub. It displays a noise sequence of amplitude in the murmur PCG signal greater than normal and extra-systole PCG signals. The amplitude of a PCG signal is greater than the normal PCG  signal but lesser than the murmur PCG signal in the extrasystole PCG signal.

IV. EXPERIMENT & RESULTS
Four separate ConvNet models termed ConvNet-MFCC, ConvNet-CQT, ConvNet-VQT, and ConvNet-HCQT are designed with MFCC, CQT, VQT, and HCQT spectrograms, respectively. To build the proposed ConvNet models, Keras, an open-source Python library, has been used that can run on top of different machine learning libraries like TensorFlow. In addition, the Librosa library in Python is used for generating MFCC, CQT, VQT, and HCQT spectrograms.
ConvNet models used in this phonocardiogram signal classification model using these spectrograms have four convolutional layers. The first convolution layer has a size of 32-5 × 5, the second convolution layer has a size of 64-5 × 5, the third convolution layer has a size of 64-5 × 5, and the last layer has a size of 32-5 × 5. A subsampling layer using max-pooling follows the first two convolution layers. The size of these max-pooling layers is 2 × 2 with a stride of size 2 × 2. The final layer of the ConvNet model is a fully connected layer with a softmax non-linear activation function with five units. These five units in the last layer are essential for this five-class phonocardiogram signal classification problem.
Additionally, two dropout layers are also used to avoid overfitting with a 0.4 drop rate. The size of the MFCC spectrogram images is 128 × 130. The model is compiled after design. The optimizer is the gradient descent algorithm based on 'Adam' optimizer and cross-entropy loss to calculate the prediction error rate. The values 0.0001 are used as the learning rate. This optimizer uses backpropagation to update the weights of the neurons. It computes the derivative of the loss function regarding each weight and deducts it from the weight. A categorical cross-entropy loss function is utilized due to the multi-class nature of the problem, which has the VOLUME 9, 2021 form given by (7): W = weight matrix, x i = i th training sample, y i = class label for the i th training sample, b = bias term, N = sample count, W j, and W yi are the j th and y th i column of W. 300 epochs with batch size 128 are used for training. Fig. (9-12) shows the accuracy and loss curves for the train and test set during the training of ConvNet models. The shape and dynamics of these learning curves are studied to diagnose the behavior of a ConvNet model. Three common dynamics observed in these learning curves are under-fitting, overfitting, and optimal fitting. From these plots, it can be verified that the ConvNet-HCQT model has offered optimal fit in comparison to other models. Fig. 13 offers the results for these experiments in terms of the confusion matrix. Confusion Matrix is a N × N matrix, in which rows represent the true categories and the columns represent the classified category by the model. The number n i,j at the intersection of i-th row and j-th column is identical to the number of cases from the i-th phonocardiogram signal class which have been categorized as belonging to the j-th phonocardiogram signal class. It is extremely useful for measuring precision, recall, F-score, accuracy, and most importantly AUC-ROC curve. All these performance metrics are computed and presented in the next section to compare all four models.
F − Score = (2 * P * S) (P + S) where T + , T − , F +, and F − are the truly projected positive, truly negative cases, false-positive cases, and false-negative cases, respectively. The results in terms of the above performance measures are offered in Table 2. The results clearly show that ConvNet-HCQT beats other models. The average accuracies achieved using HCQT is 96%, whereas it is 93%, 94%, and 94%, respectively, for ConvNet-MFCC, ConvNet-CQT, and CovNet-VQT models. The performance of ConvNet-CQT and CovNet-VQT models is the same but superior to ConvNet-MFCC. MFCC features are widely used features in the past for heartbeat sound classification. In comparison to earlier work, the experimental results show that the proposed strategy achieves good outcomes. The proposed method outperforms the PhysioNet/Computing in Cardiology Challenge2016's stated best accuracy of 0.86 for normal/abnormal binary classification. Table 3 provides  the overall accuracies for the best models presented in PhysioNet Computing Cardiology challenge [35]. Accuracies provided by these models are very much inferior to the proposed multi-class classification model using HCQT.
To further confirm the robustness of these phonocardiogram signal classification models, ROC curves are also plotted in Fig. 14. The false-positive rate on the x-axis and the true positive rate on the y-axis is plotted on ROC curves. This implies that the top left corner of the plot is the ''perfect'' point where a true positive rate of one and a false positive rate of zero. It means that a larger AUC is generally superior [36]. It is evident from the ROC curves that the ConvNet-HCQT model performs better than other models, which the AUC of these ROC plots confirms. For the MFCC-based Con-vNet model, the micro-average area and macro-average area are 1.00 and 0.99, respectively. With the HCQT-based Con-vNet model, which is 1.00, the metric macro-average area is slightly improved. The area under the curve for the artifact, extrahls, extra-systole, murmur, and normal classes are 1.00, 1.00, 0.99, 1.00, and 0.99. It can be noticed that AUC is slightly improved for these classes with HCQT based features in comparison to other features.
Commonly used time-frequency transformations and features such as DFT, DWT, and MFCC have extensively supported various acoustic recognition systems. Though they are appreciated for most acoustic analyses, they are still not customized to any particular problem. So, it may be valuable to investigate features from other time-frequency transformations such as CQT, VQT, and HCQT. CQT is a dominant feature in acoustic signal processing analysis. CQT transforms a series of time-domain signals to the frequency domain signal. It is similar to the Short Term Fourier Transform (STFT) and almost identical to the complex Morlet wavelet transform. Hybrid CQT is a more computationally efficient version of CQT. It utilizes the pseudo-CQT for higher-order frequencies where the hop length is larger than half the filter size and full CQT for the lower frequencies. The findings  [35].
of the experiments show that HCQT is more effective than traditional CQT and variable CQT.
In this study, an effort is made to suggest the best acoustic features for phonocardiogram signal classification. Fig. 15 presents the comparison of the HCQT-based Con-vNet model with others. Results have proved that HCQT outperforms other acoustic features of the time-frequency domain. It would be fascinating to investigate a larger number of architectural configurations and filter banks, as well as hyperparameter sets, in the future.

VI. CONCLUSION
Diagnose at an early stage is the only way to decrease the mortality rate occurring due to CVD. However, due to a lack of awareness for routine health checkups and unavailability of all resources at low cost, there are major hurdles in the early diagnosis of CVD. The situation worsens in developing countries where population density is high, and a doctor is not available in remote locations. To target these issues, the authors have offered a design of a decision support system that utilizes the PCG signals for the early diagnosis of CVD. PCG signals can be captured by a small, low-cost handheld device called a stethoscope. In this work, a multi-class phonocardiogram signal database with five classes, namely, extra heart sound, artifact, extra-systole, normal, and murmur heartbeat, are used to design the phonocardiogram signal, classification model. The authors have designed a PCG signal classification model with a new acoustic feature HCQT. HCQT has been formed by combining two CQTs consisting of dissimilar resolutions for treating the high-frequency bins of the conventional CQT. Analysis of results has proved that HCQT is a superior feature that generally applies acoustic features like MFCC, CQT, and VQT. Through the proposed work, the authors have achieved an accuracy of 96% in the multi-class classification of PCG signals.
In future work, authors have planned to ensemble multiple spectrograms to get more discriminative stacked features. Also, classification accuracy may further be improved by using other deep learning architecture like Recurrent Neural Network (RNN). Moreover, the authors have also planned to use an ECG signal with the PCG signal to design the multimodality model using these acoustic features.