Conv-Random Forest-Based IoT: A Deep Learning Model Based on CNN and Random Forest for Classification and Analysis of Valvular Heart Diseases

Cardiovascular diseases are growing rapidly in this world. Around 70% of the world’s population is suffering from the same. The entire research work is grouped into the classification and analysis of heart sound. We defined a new squeeze network-based deep learning model—convolutional random forest (RF) for real-time valvular heart sound classification and analysis using industrial Raspberry Pi 4B. The proposed electronic stethoscope is Internet enabled using ESP32, and Raspberry Pi. The said Internet of Things (IoT)-based model is also low cost, portable, and can be reachable to distant remote places where doctors are not available. As far as the classification part is concerned, the multiclass classification is done for seven types of valvular heart sounds. The RF classifier scored a good accuracy among other ensemble methods in small training set data. The CNN-based squeeze net model achieved a decent accuracy of 98.65% after its hyperparameters were optimized for heart sound analysis. The proposed IoT-based model overcomes the drawbacks faced individually in both squeeze network and RF. CNN-based squeeze net model and RF classifier combined together improved the performance of classification accuracy. The squeeze net model plays a pivotal part in the feature extraction of heart sound, and an RF classifier acts as a classifier in the class prediction layer for predicting class labels. Experimental results on several datasets like the Kaggle dataset, the Physio net challenge, and the Pascal Challenge showed that the Conv-RF model works the best. The proposed IoT-based Conv-RF model is also applied on the selected subjects with different age groups and genders having a history of heart diseases. The Conv-RF method scored an accuracy of 99.37 ± 0.05% on the different test datasets with a sensitivity of 99.5 ± 0.12% and specificity of 98.9 ± 0.03%. The proposed model is also examined with the current state-of-the-art models in terms of accuracy.


I. INTRODUCTION
D EEP Learning finds many applications in natural language processing (NLP), speech recognition, pattern recognition, image analysis, and medical image diagnosis.In medical image diagnosis, heart sound diagnosis and early screening prove to be very effective and challenging as well.Sparse networks find better applications in terms of heart sound analysis, as it is most suited to learning local features in a heart sound compared to deep neural networks.Deep networks generally have many added layers, which, in turn, makes their time complexity higher.We defined our work based on the following two hypotheses.
1) Squeeze network-based CNN model is used as a feature learning part which scored very high accuracy in heart sound analysis problems.2) Random forest (RF) algorithm is used for the classification part, which attained better accuracy in terms of classification of heart sound disorder compared to a single decision tree (DT) method and gradient boosting (GB) methods.
In classification problems, ensemble methods are primarily used and are more suitable than any other supervised machine learning algorithms.The objective of this research work is to develop an Internet-enabled low-cost, portable, and accurate Raspberry Pi-based electronic stethoscope system.The proposed stethoscope system can be applied to subjects even in remote places for the measurement and analysis of valvular heart sounds.The said stethoscope is also ear contactless since auscultation can be done through a Bluetooth-connected microspeaker.
Mainly, the ensemble methods can be of three types.
In the bagging algorithm, the input training set is subdivided into different random samples, and each sample is fed to the DT.The result of DTs is combined through voting, and finally, the output is generated, as presented in Fig. 2.
In the stacking ensemble method, the input training set is fed to different models, and their predictions are combined to produce the final predicted output from the ensemble model, as shown in Fig. 2. In boosting the algorithm, the input training set is fed to different models.The effectiveness of the next model is improved through the previous model output prediction.Finally, all the predicted outputs from other models are combined to produce the final predicted output, as given in Fig. 3.  Ensemble machine learning methods like RF, DT, GB, and extreme GB (XGB) methods are found a special place in classification problems as they achieve a reasonable amount of accuracy for large training datasets as well as low training time.
Supervised machine learning methods like Naïve Bayes (NV), support vector machine (SVM), and multilayer perceptron (MLP) are also used in classification problems, but they are not suitable for large training data sets.

II. PAPER ORGANIZATION
Section III provides a literature study of different deep learning methods used in heart sound analysis.Section IV briefs about the methods and materials used in this research paper.Section V highlights the result analysis of the research work.Eventually, Section VI summarizes the conclusions and future scope of the research.

III. LITERATURE REVIEW
In this section, a detailed discussion on heart sound analysis is carried out for classification.As per the literature survey, many research works are carried out in this area.Dwivedi et al. [1] did a study on methods for automation in heart sound analysis and classification.Many researchers have worked on various techniques for the classification of heart sounds.Mishra et al. [2] did research on identifying different segments of heart sounds for the identification of basic heart sound segments by applying the CNN method.This study has limited scope on analysis of S3 heart sounds.Mishra et al. [3] derived a novel method for the separation of heart and lung sounds of the phonocardiogram signal for the identification of the S3 heart sound.However, this method contains a few limitations when working in realtime applications.Muduli et al. [4] derived a novel algorithm for the extraction of biomedical signals from noisy calculations by sparse recovery analysis, and Barma et al. [5] did research on the study of S2 heart sounds, which engages with calculating the time period and the energy of normalized cardiac sounds, however, they could not distinguish heart sounds.Dewangan et al. [6] and Mishra et al. [7] did research work on heart sound analysis using the wavelet transform algorithm.It has specific restrictions in online PCG signal analysis.Othman and Khaleel [8] did work on PCG signal analysis using Shannon energy envelop and DWT features, but it could not distinguish the heart sounds accurately.Lubaib and Muneer [9] did work on heart defect analysis and classification using the pattern recognition technique, but the adopted technique is entirely based on echo imaging.Singh and Cheema [10] did work on PCG signal analysis using classification by feature extraction, but the method has limited scope working in real-time applications.Ahmad et al. [11] did work on heart sound analysis using a soft computing-based fuzzy classifier model based on Mamdani-type fuzzy computation.The authors worked on a standard cardiac sound repository using an offline-based method which was not validated with the human subjects.Roy et al. [12] reviewed the papers on discrimination of cardiac signals.Gupta et al. [13] researched various stages in cardiac sound signal for heart sound signal analysis.There is no important work on designing the real-time screening device for valvular diseases.Apart from these works, a few other current research works that are related to CNN-based PCG signal analysis and classification are highlighted in Table 1.

IV. METHODS AND MATERIALS A. CARDIAC SOUND BANK DESCRIPTION
Heart sound [50] samples that are used for the valvular heart sound analysis have been considered from four heart sound repositories as in [50] and [51]  1) https://github.com/yaseen21khan/Classification-of-Heart-Sound-Signal-Using-Multiple-Features[15], [17], [50].A brief explanation of the cardiac sound bank is shown in Heart Sound Dataset1 in Table 2. Five types of cardiac sound samples are considered, namely, NS, MR, MS, MVP, and AS.Every cardiac sound lasts for a time duration of 5-10 s, sampling frequency of 44 000 Hz, and has a bandwidth of 65-500 Hz.
2) Kaggle Dataset [49], [50], [51] is also used for the valvular heart sound analysis.Kaggle's heart sound repository contains a collection of NS and heart sound data containing murmurs.Cardiac Sound Dataset 2, as mentioned in Table 3, is obtained from the Pascal Heart Sound Repository-Dataset B [15], [36], [50].The cardiac sound samples last for a time duration of 3-9 s, have a sampling frequency of 45 000 Hz, and have a bandwidth of 68-512 Hz as described in the following.
Table 4 highlights a detailed description of the Physio Net Challenge Training Set, [16], [37], [50] that contains six training databases (A through F) comprising a collection of 3128 cardiac sample data.The cardiac sound samples last for a time duration of 4-8 s, a sampling frequency of 47 000 Hz, and contain a bandwidth of 72-492 Hz.

B. METHODOLOGY ADOPTED IN THE PCG SIGNAL ANALYSIS
The methodology used in the cardiac sound analysis is described as follows.
Fig. 4 describes the schematic diagram of heart sound signal classification [22], [30], [50].The preprocessing block comprises normalization and filtering operations.Normalized heart sound goes to the filtering block, where a bandpass filter of bandwidth 70-450 Hz is used for background filtering of unwanted noise.A time frame of 3 s is considered for every heart sound.The various in-depth features are captured from the preprocessed signal, and eventually, categorization of the cardiac sound signal is done for validation of the adopted model where s (t) is the preprocessed cardiac sound data.The heart sound samples [50] are split into train data (85%) and test data (15%).Second, training samples are again split into validation data (15%) and the remaining for training the software model.Fig. 5(a) highlights the block diagram of the display unit and Fig. 5(b) is the experimental setup of the hardware system where it uses a conventional stethoscope chest piece, preamplifiers, filters, 7-inch touch screen led, and industrial Raspberry Pi 4B.
Fig. 6 provides the schematic representation of the proposed hardware model.The PCG signal is captured through the PCG signal acquisition module that contains a stethoscope chest piece, preamplifiers, filters, and buffer amplifier circuits followed by the processing and displaying unit as the industrial Raspberry Pi 4B.Fig. 7 is the block diagram of the proposed electronic stethoscope that contains input sensor unit and computational unit.The real-time PCG signal is captured with a chest piece and microphone.The captured PCG signal goes through the   preamplifier unit of gain 20, followed by a 50-Hz notch filter to reject the electrical interferences.The processed signal is fed to an analog tunable band-pass filter with a 32-472-Hz spectrum.The PCG signal [51], [52] generally belongs to 32-472 Hz for normal and abnormal sounds.
The unity gain buffer amplifier is selected for impedance matching.The signal-conditioned output goes to the node MCU (ESP32) which is WiFi enabled and contains a 12-bit ADC with a sampling frequency of 44.1 kHz.Finally, the converted digital PCG signal [19], [33] goes to the Raspberry Pi through its enabled WiFi for further signal processing and analysis.The classified heart sound is displayed on a 7-inch LCD screen attached with the Raspberry Pi and is heard in a Bluetooth-enabled microspeaker.Preprocessing Unit: Preamplifier, notch filter, bandpass filter, and unity gain buffer.
Frequency domain Features: DWT.Classification Unit: The proposed CNN model built in Python 3.9.2ver.stored and implemented using Raspberry Pi 4B.
The valvular cardiac samples used for the study of PCG signal analysis are broadly categorized into the following.
3) AS. 4) MR. 5) AR. 6) MVP.7) EXT.Acoustic stethoscopes based on sensors, such as diaphragms and piezoelectric crystals, work on the principle of conversion of sound pressure into electrical energy.They suffer from distortion in output electrical signal, so electronic stethoscopes are developed that can incorporate capacitive electret microphone sensors for better efficiency and stability as given in Fig. 7.
Message queuing telemetry transport (MQTT) is a standard protocol used for sending the processed heart sound data to the Raspberry Pi (MQTT Broker) from ESP32 (MQTT Client) through a WiFi connection.MQTT broker and subscriber are on the same device (Raspberry Pi).The input sensor unit and computational unit are connected through WiFi and they are Internet enabled as shown in Fig. 8. Thus, they are in the Internet of Things (IoT).
Features [13] of the cardiac sound used for the overall analysis are as follows.

C. PROPOSED CONV-RANDOM FOREST LEARNING ALGORITHM
Let X = {(x j , y j ); 1 ≤ j ≤ T}, where T denotes the length of the training data set, , and y j denotes the name of vector x j .The proposed method for the Conv-RF is explained as follows.
2) If required, add zeros to the N items of every training data element, x j , so that a new sample element could be framed into a matrix of shape, ). 4) Fix the metrics of the conv as: a) total conv layers, L; b) output depth, Z; c) in every layer, fix the filter sizes, K (l) , and; d) filter strides, s (l) k .5) In every conv layer, denoted by l, a conv function, and an additive bias are done to the input for a feature vector denoted by f ∈ {1, . . ., f (l)}.Thus, the output, γ i (l), of the lth layer for the ith feature map, is considered from the output of the earlier layer, γ i (l − 1).For every layer, l, in 1 . . .L: compute the convolutions to produce the γ i (l) for layer, l where ∅ denotes the ReLU activation function, B i (l) is a bias matrix, and k ij (l) is the filter of size  The time complexity of the RF classifier plays an important part in determining the total floating point operations per second (FLOPs) and number of trainable parameters in the proposed model.

D. PROPOSED CNN-BASED SQUEEZE NETWORK
A CNN-based Squeeze Net [22], [31] contains many intermediate layers and fire modules between the input and output nodes.The said CNN model can work with any real-world problem having a large number of data.Neural networks [8], [17], [47] are very much helpful in providing answers to the challenges that are witnessed in real life.A deep network reacts to the inputs provided, does difficult computations on them, and eventually generates output.Besides this, backpropagation is the working algorithm in training these deep learning models [23], [30].The skeleton structure of the convolutional neural network-based deep learning network is provided in Fig. 10.The model summary is presented after training and validation of the dataset through this proposed model.Fig. 9(a) is the block diagram of a used fire block in a squeeze network, where it uses a combination of both squeeze filter and expand filter.It can be seen that outputs from both the filters get concatenated.Fig. 9(b) provides the skeleton of the used squeeze filter and expand filter.The squeeze filter contains three 1×1 convolutions, whereas the expand filter comprises of four 1×1 convolutions and four 3×3 convolutions.Basically, it is a sparsely connected network having a maxpool and multiple convolutions of kernel sizes 1, 3, and 5 at the same layer, followed by an application of concatenation operation from all filter outputs.
Table 5 provides the proposed Squeeze Net architecture using five fire blocks, two convolutional layers with a ReLU activation function, two maxpool layers, an input layer, and an output layer with a softmax activation function.
Fig. 10 provides the entire architecture of the proposed Squeeze Net model used for the valvular heart sound analysis.In this model, two convolutional layers are considered, three maxpool layers are used, five fire blocks have been implemented, followed by one global average pool layer, and an output layer having a softmax activation function is used.
The third and fourth hidden layers have fire block modules, followed by the fifth layer having a maxpool layer.The sixth and seventh hidden layers contain fire block modules, again followed by the maxpool layer and fire block module.The second conv layer is followed by the global average pool layer.The output layer comprises of five nodes using the softmax activation function to categorize five various classes of cardiac sounds [24], [29].
Characteristic plots are taken up for the proposed system implemented with normal and abnormal cardiac sound samples, and these are provided in Figs.11 and 12.
Fig. 11 presents the cross-entropy loss of the adopted squeeze network during the training and validation stage applied on dataset1.The curve of cross-entropy loss with the number of epochs (i.e., 100) predicts that loss reduces as the   Table 6 represents the proposed CNN-based squeeze network that achieved an accuracy of 98.65%.

E. RANDOM FOREST ALGORITHM
RF is an established and popular supervised ensemble machine learning method used mainly for real-time classification-based fields [22], [25], [50].This method is also used to challenge its effectiveness in classifying normal and abnormal cardiac samples.
In Fig. 13, a cardiac sound repository is split into the training and test data.Training data is further decomposed into 800 data samples called estimators for producing predictions.The eventual predicted result for valvular cardiac sound [50] is generated via the mean of all the predictions obtained through the estimators.The effectiveness of the RF system is evaluated, and various parameters are selected for the computation of its efficiency as provided in Table 7.In Section V, by studying  the results of Figs.20 and 21, the model shall categorize the cardiac samples more effectively with better accuracy than the other classifiers.

F. ARCHITECTURE OF THE ADOPTED CONV-RANDOM FOREST
The proposed model of Conv-RF is given in Fig. 15.The proposed model contains six layers: 1) input layer; 2) data preprocessing layer; 3) conv layers; 4) reshape layer; 5) class prediction layer; and 6) output layer.
All layers are subdecomposed into two sections: one for feature extraction and another for the prediction of the class labels.Each layer has a different function.
Feature Extraction Section: This section extracts the important information from the data samples undergoing training.It contains three layers: 1) the input; 2) data preprocessing; and 3) conv layers.The prediction accuracy of the proposed system relies on efficient feature learning.
Details of each layer are explained as follows.
1) Input Layer: The input layer takes input from the standard heart sound sample in the proposed model.It is considered that a training data set, X, contains a set of tuples (x j , y j ), where j denotes the index of the data set.
x j is a √ N × √ N feature vectors, and y j stands for class label considered for vector, x j .In case the training set is in the mentioned shape, it shall be directly allowed to the conv layers of the feature extraction stage.Else, it shall be transformed in the data preprocessing layer.2) Data Preprocessing Layer: In the data preprocessing layer, a square matrix shape is considered for the tensors in the convolution block.If the input data is not in the matrix shape, with size √ N × √ N, it will be converted, by adding zeros as required.Various data types are also transformed in this layer.3) Convolutional Layer: The convolutional layers play a significant role in this proposed model.They are responsible for learning the features of the input by applying a convolution method known as Squeeze network.The data is basically a tensor, with shape, √ N, √ N, z (l) , where z (l) is the total filters in the l layer The pooling layer changes the convolution layer outcome.The said model downsamples and restricts the overfitting of the proposed model.It modifies the outcome with the highest or mean value within a rectangular window of square matrix form.For instance, if (γ l i ) m,n is an output of the earlier layer with ∅ an activation function, then P(.) is a pooling function that acts on (γ l i ) by allowing it through a pooling method with stride, S p , and w p × h p pooling window.Typically, pooling works by placing windows at nonoverlapping places in every feature map and taking one item per window so that the feature maps are subsampled.Two types of poolings are mainly used: 1) average pooling and 2) max pooling.In max pooling, the highest value of every window is considered.Hence, the output of a max-pooling function is where the max function is fed to the max-pooling window of the mentioned shape.
Predicting the Class Labels: RF is an ensemble algorithm that is better than a single DT because it decreases over-fitting by taking the mean of the outcome.It works very fine with a large number of data elements than a single DT.
Reshape Layer: This layer converts the convolutional layer tensor output into the required vector form.
Class Prediction Layer: The major application in this layer is to predict the class using RF.
Output Layer: The output layer shall receive the predicted class result, and depending on that accuracy of the model can be computed.8 provides a summary of volunteers having different age groups and genders with past medical history.
Table 9 highlights PCG signal analysis made on selected volunteers in different postures, such as sitting, standing, and supine with different locations like upper right sternal border (URSB), upper left sternal border (ULSB) position, and lower left sternal border (LLSB).
Table 10 is the analysis of PCG recordings done with the developed stethoscope on the selected volunteers with      Conv-RF model adopted in the valvular cardiac sound analysis.
Fig. 17 presents the accuracy of the said Conv-RF system during the training and validation stage applied on dataset2.The curve of accuracy with the number of epochs shows that accuracy rises as the number of epochs grows for the adopted model.Fig. 18 presents the cross-entropy loss of the adopted Conv-RF model during the training and validation stage applied on dataset2.The curve of cross-entropy loss with the number of epochs predicts that cross-entropy loss reduces as the number of epochs grows for the used CNN model.The electronic stethoscopes available in the market cost around €300-€399.The total expenditure incurred for the development of the proposed system for predicting heart diseases is only around €220.Since auscultation is done through a Bluetooth-enabled speaker, it is safe to use for patients as well as for health professionals.The said stethoscope is also very easy to use as it is AI enabled.comparison study of the developed stethoscope with other stethoscopes based on various factors like use, price, safety and protection, digital storage, etc.

V. RESULT ANALYSIS
Table 13 provides the comparison study of runtime (in seconds) of the Conv-RF model with the RF method and Squeeze Net-based CNN model for different datasets used in the heart sound analysis.The runtime environment of the proposed model is the Raspbian operating system with Thonny ide in Raspberry Pi 4B using Python version 3.9.2.
Table 14 provides the comparison study of runtime (in sec) of the Conv-RF model with other ensemble learning methods like DT, GB method, and XGB method.
The proposed model is also compared with SVM and MLPs for different datasets used in the valvular heart sound analysis.

VI. CONCLUSION AND FUTURE SCOPE
Experimental data showed that the Conv-RF model provides decent outputs in terms of accuracy, sensitivity, recall, and f 1-score.This is also observed that the runtime (in seconds) of the Conv-RF method is the lowest among all other methods for different datasets used.The lowest runtime is very helpful in having a fast early screening of any kind of valvular cardiac disorder.The proposed modified squeeze network is highly compatible with the ensemble RF method in terms of their respective architectural breakdown.A few limitations of the proposed model are the ambient noise that cannot be removed to a full extent.The cable of the microphone chest piece movement generates noise during   auscultation.Some of the future scopes of this research are auscultation time which is closely around 2 min and requires further minimization.More number of volunteers with clinical assessment are needed for statistical validation of the developed model.

VII. DISCUSSION
In the hardware development part, this work deals with the design of an Internet-enabled electronic stethoscope.The proposed electronic stethoscope is based on the combination of Raspberry Pi and ESP32.
The Signal conditioned output goes to the node MCU (ESP32) that is WiFi enabled and contains a 12-bit ADC with a sampling frequency of 44.1 kHz.Finally, the converted digital PCG signal goes to the Raspberry Pi through its enabled WiFi for further signal processing and analysis.The classified heart sound is displayed on a 7-inch LCD screen attached with the Raspberry Pi and is heard in a Bluetoothenabled microspeaker.
In the software development part, a novel CNN-based convolutional-RF algorithm is developed.The squeeze network is used as a CNN model for feature extraction and the RF method is used as a classifier in the output part.The combination of both of them proved to be a decent valvular heart sound classification algorithm in this field.
by T. S. Roy et al. in 2023.

FIGURE 4 .
FIGURE 4. Methodology used in heart sound classification.

FIGURE 6 .
FIGURE 6.Schematic representation of the proposed electronic stethoscope system.

1 )
Proposed Conv-RF Method: The machine learning methods proposed to classify the heart sounds are as follows.a) RF. b) GB. c) XGB.All deep learning-based algorithms are written in Python ver.3.9.2 using Thonny Python editor (Linux).The brief description of the proposed algorithm mentioned above is explained under the software development of the proposed deep learning model.

FIGURE 7 .FIGURE 8 .
FIGURE 7. Schematic of the input sensor unit and computational unit used in the proposed electronic stethoscope.

7 )
Start with a new training sample data for the class prediction layer X new = γ i γ i (l), y j; 1 ≤ j ≤ T. 8) In the reshape layer, random data are considered from the given dataset.9) Next, this proposed method shall produce a DT for each considered random data.Then, it shall get the predicted output from each formed DT. 10) Voting is carried out for every predicted outcome.11) Eventually, the maximum voted prediction outcome is considered as the eventual predicted outcome.12) Computing the time complexity of the classifier: Train-Time Complexity of RF = O(d * log(d) * c * f ), where c = count of DTs.d = count of sound samples in the train set.f = count of information in the sound samples.Time Complexity of RF = O (depth of tree * c) Depth of tree = 5.

FIGURE 10 .
FIGURE 10.Proposed CNN-based squeeze net system description.

FIGURE 11 .
FIGURE 11.Characteristics curve of loss versus epoch during training and validation in CNN-Based squeeze net.TABLE 6. Description of squeeze network MODEL effectiveness.

Fig. 14
provides [50] the accuracy of the modified RF system during the training and validation stage.The curve of accuracy versus training sample size depicts that accuracy during the training and validation stage goes down to one as the training sample data grows for the improved RF model.

FIGURE 14 .
FIGURE 14. Characteristics curve of accuracy versus epoch in RF.

Fig. 15 is
the architectural block diagram of the convolutional-RF model.The proposed Conv-RF model comprised of feature learning part and predicting the class label part.Table

FIGURE 15 .
FIGURE 15.Architectural block diagram of the proposed Conv-RF model.

FIGURE 16 .
FIGURE 16.Operational process flow diagram in the proposed Conv-RF model.

Fig. 16
Fig. 16 shows the operational process flow diagram in the proposed Conv-RF model, which highlights the different tensor transformations in the input convolutional layer and reshape layer.Table 11 is the architecture of the proposed

Fig. 21
Fig. 21 is the comparison of the developed Conv-RF model with other CNN-based models for different datasets.The electronic stethoscopes available in the market cost around €300-€399.The total expenditure incurred for the

FIGURE 21 .
FIGURE 21.Conv-RF model versus other models for different datasets.

TABLE 8 . Description of the volunteers [50], [51].
Table 11 is the architecture of the proposed

TABLE 9 . Proposed conv-RF model applied on test volunteers [50], [51].
Experimental Methods and Results: In this experiment, various standard datasets have been considered, and fivefold cross-validation is used for the training and test data samples.The results from each testing set are recorded, and the average value is computed, as shown in the figure below.The ensemble RF classifier result is compared with other ensemble methods, such as single DT, GB, and XGB method.The result obtained with the proposed Convolutional-RF algorithm is compared with different CNN-based models like LeNet-5, Alex Net, VGG16, VGG19, DenseNet121, Inception Net, Residual Net, Xception Net, and ConvXGB.It is found that the Conv-RF method gives the best result

TABLE 11 . Proposed conv-RF architecture. in
terms of valvular heart sound analysis.Figs.19 and 20 highlight the comparison of the RF algorithm with other ensemble algorithms and other models for different datasets used in this article.