Neonatal Bowel Sound Detection Using Convolutional Neural Network and Laplace Hidden Semi-Markov Model

Abdominal auscultation is a convenient, safe and inexpensive method to assess bowel conditions, which is essential in neonatal care. It helps early detection of neonatal bowel dysfunctions and allows timely intervention. This paper presents a neonatal bowel sound detection method to assist the auscultation. Specifically, a Convolutional Neural Network (CNN) is proposed to classify peristalsis and non-peristalsis sounds. The classification is then optimized using a Laplace Hidden Semi-Markov Model (HSMM). The proposed method is validated on abdominal sounds from 49 newborn infants admitted to our tertiary Neonatal Intensive Care Unit (NICU). The results show that the method can effectively detect bowel sounds with accuracy and area under curve (AUC) score being 89.81% and 83.96% respectively, outperforming 13 baseline methods. Furthermore, the proposed Laplace HSMM refinement strategy is proven capable to enhance other bowel sound detection models. The outcomes of this work have the potential to facilitate future telehealth applications for neonatal care. The source code of our work can be found at: https://bitbucket.org/chirudeakin/neonatal-bowel-sound-classification/


I. INTRODUCTION
Bowel sound, i.e., peristalsis sound, denotes the gurgling, rumbling, or growling noises from abdomen, which is produced by the movement of food, gas and fluids during intestinal peristalsis [1]. It contains valuable information of gastrointestinal condition. Bowel motility disorder can be reflected as an abnormality or lack of peristalsis sound. Auscultation of bowel sound is therefore a routine assessment in neonatal care [2], which is particularly important to newborns who are admitted to NICU. The assessment enables early diagnosis of bowel dysfunctions of newborns, such as feeding intolerance and intestinal obstruction [3], [4], and Necrotising Enterocolitis (NEC), a potentially fatal disease affecting 6% of infants born with under 1500 grams birth weight [5]. Extremely preterm newborns have immature gut and frequent dysmotility shown clinically by feed intolerance bilious gastric aspirates and abdominal distension. Most of these infants have recovering respiratory distress syndrome needing nasal Continuous Positive Airway Pressure (CPAP). Nasal CPAP commonly distends the gut with gas into the stomach leading to "CPAP belly syndrome" [6]. This is frequently associated with increased dysmotility and clinical confusion with early NEC. Alteration of bowel sounds is a part of assessment for ealry NEC development.
Traditionally, neonatal bowel condition has been qualitatively assessed by placing a standard acoustic stethoscope on the abdominal wall of the neonate. By listening to the abdominal sounds in all four quadrants (right upper, right lower, left upper and left lower quadrants), the bowel condition can be roughly determined based on whether the auscultated bowel sound is frequent, diminished, or absent. Although it is simple and easy to perform, the assessment relies heavily on the skill and knowledge of the on-site specialist and the application of the stethoscope is limited by the varying quality and irreproducibility of the collected bowel sounds. With the development of digital stethoscope (DS), a more dynamic and objective way can be provided to monitor neonatal bowel conditions. In contrast to the acoustic stethoscope, digital stethoscope converts acoustic sounds to electrical signals and allows storage and transmission of the signals for computerized process, medical crowdsourcing, and telehealth applications. However, despite the improvements in auscultation introduced by digital stethoscope, interpretation of neonatal bowel sounds still imposes a heavy demand on workload and skilled personnel [7]. Automatically detecting and locating the bowel sounds in digital stethoscope recordings would enable a more objective, effective, and efficient diagnosis [8].
Automatic detection of bowel sound has been investigated in previous studies. A Multi-layer Perceptron (MLP)-based method for long-term abdominal sound monitoring was presented in [9]. The study explored the rationale of using timefrequency features and wavelet-adapted parameters for bowel sound recognition. However, the method of evaluation seemed to suffer from data leakage as the training and testing samples were allowed to be segmented from the same recording. A bowel motility detection model was proposed in [10]. While the model allowed real-time abdominal sound analysis and result notification, it was only based on energy features which are very sensitive to noise. Since the neonatal bowel sounds arXiv:2108.07467v3 [cs.SD] 1 Jun 2022 are inherently weak, impact of noise is a major concern in this approach. The model was later extended in [11] by adding spectral features and introduction of Naive Bayesian and minimum statistics. Although a decent overall accuracy was reported, the application of this model was still limited by the bowel sound detection sensitivity (35.23% in total). The feasibility of using Higher Order Statistics for bowel sound detection was discussed in [12]. The method was proven effective and noise robust in synthetic data, but it performed with some limits in real-world collected data because it misclassified all non-peristalsis as peristalsis samples. Bowel sound recognition using Support Vector Machines (SVM) classifier was investigated in [8]. The method provided a better-balanced sensitivity and specificity as compared to the aforementioned studies. Yet, the model was only evaluated on ideal laboratory data without any talking and equipment noises. The applied value of the model in real-world scenarios is yet to be explored.
Deep learning has shown great promise in many biomedical applications [13], [14]. Significant research is still underway to realize the full potential of deep learning in biomedicine and addressing its potential challenges [15]. It was applied in the latest study on automated detection of bowel sounds [1]. The method used a human voice recognition algorithm based on Mel Frequency Cepstrum Coefficient (MFCC) features and Long-Short Term Memory (LSTM) [16] neural network was applied to bowel sound detection. This novel attempt was theoretically justified and experimentally proven feasible. However, audio recordings were segmented into 0.1-second pieces for processing in this work. It seems less likely for LSTM to extract useful time dependency in such short segments.
Despite the progresses made, none of the existing methods have proven feasible and effective for detection of neonatal bowel sound. There are no current benchmarks for objective neonatal peristalsis detection and characterisation. Due to the weakness of neonatal intestinal peristalsis, the interference of irregular breath and heart sound, and the various environment and equipment noises in NICU [17], abdominal sounds recorded from newborns are usually of poor quality, which presents unique challenges to identify and characterize bowel sounds.
In this work, we propose an effective method to automatically identify and locate bowel sounds in neonatal abdominal auscultatory recordings, which is a key step towards bowel condition diagnosis. Specifically, our method leverages both Deep Learning (DL) and Laplace Hidden semi-Markov Model (HSMM) for the neonatal bowel sound classification. Here, the DL model provides the probabilistic outputs based on the high-order semantic information present in the acoustic signals, whereas the HSMM uses these probabilistic outputs and temporal patterns to provide the optimized linear sequence labeling information. This complementary components altogether result in the improvement of classification performance significantly. Evaluation on real neonatal bowel sound dataset shows that our method produces a superior result (89.81%) compared to the second-best method (86.86%) in terms of accuracy. To the best of our knowledge, this is the first study on automated "neonatal" bowel sound detection.
In summary, the main contributions of this paper are as follows: (1) design of a novel CNN architecture to classify the neonatal bowel sounds into two classes (P and NP), which, we believe, is the first study; (2) optimize the probabilistic outputs of the proposed CNN model to improve the performance further using the Laplace HSMM model; and (3) demonstrate the superiority of the proposed approach (CNN model+Laplace HSMM) against several wellestablished popular ML methods and other DL methods using popular evaluation measures. The rest of this paper is organized as follows. Section II describes our neonatal bowel sound detection method in detail. Section III presents the experimental results. Discussion is given in Section IV. Section V concludes the entire work.

II. METHODS
This section presents our methods for data collection, signal processing, neonatal peristalsis sound detection, and model evaluation. Specifically, the peristalsis sound detection method consists of two modules: CNN for initial sound segment classification and a Laplace HSMM for classification optimization.

A. Data Acquisition and Annotation
A total of 49 newborn infants admitted to our tertiary NICU for management of prematurity were recruited for the study during the period from February 2018 to September 2018. The newborn infants were selected based on the following inclusion criteria: 1) preterm or term infants on full feeds > 120 mL/kg/day; 2) infants not critically unwell, tolerating full enteral feeds and without any known diagnosis of bowel disease; 3) infants on continuous positive pressure ventilation (CPAP). Newborn infants who were intubated, ventilated, on mechanical ventilation, requiring any inotropes, or having significant chromosomal anomalies were excluded from this study. Informed consent was obtained from each study participant's parent. The study received human ethics approval from Western Sydney Local Health District Human Research Ethics Committee on 12 Apr 2018 (LNR/17/WMEAD/516).
The 3M TM Littmann digital stethoscope (Model 3200 with Bluetooth, 3M, USA) [18] was used for abdominal sound auscultation in this study. We used the diaphragm mode which amplifies the sounds from 20 -2000Hz, but emphasizes the sounds between 100 -500Hz. For each recruited infant, who was placed supinely on the bed, an uninterrupted 60-second abdominal sound recording was captured by applying the diaphragm of the digital stethoscope on the abdominal wall in the right lower quadrant (RLQ). The rationale behind this is that the abdominal sound from the RLQ area is less interfered by the heart and lung sound. Fig. 1 shows the used digital stethoscope and the auscultation position.
The recorded abdominal sounds were transmitted to a research laptop via Bluetooth for visual interpretation and annotation. Annotation was created and validated on ELAN [19], a research tool specially designed for annotating audio and video data, by our neonatologists to indicate the onset and duration of bowel sound, heart sound, and various NICU noises in a recording. Simultaneous ultrasound was performed to guide annotation.

B. Preprocessing
Each annotated abdominal sound recording was sliced into overlapping segments by applying a 6-second length rectangular window function. In our ablative study in section III-E, segment length of 6s yielded the best results. The step between the onsets of successive windows was set to 0.1 second. The 6-second segment length is long enough to provide useful insights for neonatologists to understand the bowel condition and also short enough for achieving a precise location of bowel sound in a recording. After segmentation, a total of 16,401 abdominal sound segments were obtained. Those segments containing bowel sound were labeled as P (peristalsis), whereas the others were labeled as NP (nonperistalsis). The numbers of P and NP segments were 14,410 and 1,991, respectively.
We calculated the MFCC features to represent each abdominal sound segment. MFCC is commonly used in automatic human speech recognition [20]. It takes into account human perception sensitivity at appropriate frequencies by converting the conventional frequency (f ) to Mel Scale (M (f )). The conversion rule is shown in Equation 1.
The reason we use MFCC on abdominal sound is that the auscultation is highly dependent on medical expert perception. Representing abdominal sound with MFCCs facilitates development of an automated model to interpret the abdominal sounds in the similar way auscultated by medical experts. More importantly, it has been proven that abdominal sounds are very similar to human speech in terms of the predictablility of the changes of spectrum versus time [1].
The hyper-parameters setting in calculation of MFCC can be found in Table I. A mean operation is performed to summarize the calculated coefficients in each time-frame to obtain a sequence of 24 MFCC values in one dimension, which will be the input to the proposed bowel sound detection model. This segment length produced the best results in our ablative study of various sequence lengths described in section III-D.

C. Convolutional Neural Network-based Classification
Convolution is useful in learning high-level representations of data, which is commonly applied in time series, image, and video data [21]. In this work, we propose a CNN to distinguish peristalsis from non-peristalsis sound. The architecture of the proposed model is shown in Fig. 2. Our proposed model accepts an MFCC sequence of length 24 as input. The most important part of CNN is the single channel convolution block, in which convolution of kernel size of 8 and number of filters of 256 at first applied to the input sequence to extract highlevel features. We choose 8 as the best kernel size in our work from the empirical study (refer to Table VI). The convolution block is specially designed to capture the burst characteristic of bowel sound. It has been proven that the bursting bowel sound can lead to an extremely uneven energy distribution in an abdominal sound recording along the time axis [22]. Since the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum, the uneven energy distribution will also be reflected in MFCC accordingly. Hence, by applying the convolutions with an appropriate kernel capturing subtle energy changes in successive time frames, our proposed CNN model can learn useful energy patterns to effectively characterize bowel sound. A total of 4 Convolutional layers are included in our proposed model. Each convolution operation is followed Dropout is used to overcome the problem of over-fitting and Max Pooling layer is applied to reduce the dimension and achieve the spatial invariant feature maps. Furthermore, inspired by the recent CNN architecture proposed by Anders et al. [23], which has suggested to use Convolutional layers after Max pooling layer, we adopt a similar approach to order in our work to capture the discriminating semantic information. The learned patterns achieved through several intermediate layers (e.g., Conv1D, Max pooling, etc.) impart hierarchical semantic information related to the input sequence, MFCC. Eventually, the network estimates a probability distribution over peristalsis and non-peristalsis classes with the Softmax function [24]. The detailed hyper-parameters used in our work are presented in Table II. To tune the hyper-parameters in this study, we first find the best optimizer in terms of higher classification accuracy keeping other parameters fixed (e.g., train/validation=0.3, lr=0.01, decay=1e-03, number of epochs=200 and batch size=32), which are chosen randomly. After the identification of the best optimizer with the highest classification accuracy, we proceed to select the best learning rate (lr) keeping other remaining parameters fixed. We continue this process repeatedly and, finally, find all optimal hyper-parameters in this study.
All of our experiments are performed in leave-one-patientout CV (LOPOCV) approach, where one subject is considered as testing split and remaining as training split. To train the DL models, we design a validation set by further splitting training set into training and validation splits with 70/30 ratio. With the help of the validation accuracy and losses, we tune the DL models. Based on such trained model, we, finally, evaluate the testing accuracy of the testing subject and repeat this process. To train other machine learning models, we evaluate the testing accuracy by using the LOPOCV approach alone and repeat this process. Both approaches (DL and non-DL) produce the testing accuracy for the comparison purpose. Our proposed model produces the best-fit convergence during bowel sound classification (refer to Fig. 3).

D. Laplace Hidden Semi-Markov Model Refinement
Abdominal sound changes over time, but the change is not completely random because peristalsis is a naturally durative process. For example, it is less likely to find a sudden pause within a continuous peristalsis or to find a transient peristalsis which just lasts for 0.1s. Therefore, for an abdominal sound recording, if we arrange all segments in time order, the transition between the segment states, i.e. peristalsis or nonperistalsis, should be subject to certain probability distribution. Motivated by this fact, a HSMM is applied to model the segment transition process to refine the bowel sound classification result. HSMM is an extension to Hidden Markov Model (HMM) [25]. They both have been proven to be useful in modeling repetitive events by considering the timing and the temporal order of events [26]. As compared to HMM, HSMM allows a hidden state to have a duration distribution, which is more suitable to be applied on abdominal sound with durative peristalsis.
The HSMM modeling process is presented in Fig. 4. Segments from an abdominal sound recording arranged in time order serve as the observation sequence. Labels of the segments are predictions from the CNN model. The HSMM states denote the potential true labels. Duration of a state represents the number of observations produced while in the state. The refinement is achieved by estimation of the HSMM state sequence.
Formally, let O 1:T = O 1 , . . . , O T denote the observation sequence, where O n ∈ [0, 1] is the predicted probability of the n-th segment belonging to class P. Let S 1:N = (j 1 , d 1 ), . . . , (j N , d N ) denote the HSMM state sequence, in which j n ∈ {P, N P } and d n represent the state label and duration of the n-th hidden state, respectively. The objective function is defined as where λ denotes the model parameters including initial probability, transition probability, emission probability, and duration probability.
The initial state distribution is determined based on the percentages of class P and NP being the first state in the training recordings. It is given by where Supp(P ini ) denotes the number of training recordings with class P as the initial state and n represents the amount training recordings. The types of the HSMM states are assumed to change in every transition. So, the state transition probability is given by We use the Laplace distribution to provide a continuous modeling of the state emission probabilities. As a result, we are able to capture them for both P and NP events that are continuous in bowel sound. Here, we have observed experimentally that emission probabilities could be properly modelled by Laplace distribution, based on this dataset. For a given state j ∈ {P, N P }, the emission probability distribution is: where σ denote the standard deviation, x the output of the corresponding class node in the softmax output layer of the CNN and µ is the mean value. We set µ = 1 if j = P , otherwise µ = 0. As compared to the conventional discrete emission probabilities, which are learned from training confusion matrix, the Laplace distribution provide the results improvements for the network output (x). Experimental evidences can be found in Section III-F. To be distinguished from the conventional practice, we refer to the modified HSMM as Laplace HSMM. The duration probability of each state is learned from the training data. We use p j (d) to represent the probability of state j having duration d, where j ∈ {P, N P }. Algorithm 1 presents the learning process, in which the duration frequency counting is performed in lines 10-17. Laplace smoothing is performed in lines 23-27 to avoid discontinuous duration probabilities.
We use the Viterbi algorithm [27] for the maximum likelihood estimation of the hidden state sequence S 1:N . The algorithm is popularly used in HMM. We modify it as follows to be compatible with the Laplace HSMM. Firstly, we define δ j (t) as the highest probability of a hidden state sequence of length n, which accounts for the observations of length t and ends in state j: and where o τ is the outcome of the Laplace distribution function that took the output of the CNN at timestep τ as its input. Finally, the most likely HSMM state sequence can be obtained from backtracking the following equation

E. Performance Evaluation
We utilize leave-one-patient-out cross validation (LOPOCV) to evaluate the proposed model. For each round of LOPOCV, abdominal sound segments from one patient are used as test data while segments from the remaining 48 patients are used for training.
1) Baseline Methods: We compare our method with multiple machine learning algorithms using hand-engineered signal features derived from the related literature, including jitters and shimmers [30]- [33], higher order statistics [12], [34], wavelet subband energies [35], spectral centroids [11], spectral subband energies [11], [36], and MFCCs [1]. A total of 121 features were extracted, as summarized in Table III. The parameters of the feature extraction methods were fine-tuned to adjust to the experimental abdominal sound recordings and the extracted features were normalized to zero mean and unit variance. To reduce the risk of over-fitting, feature selection was applied to select the best 20 features for training and evaluation. Specifically, we followed the feature selection method proposed in [37], which implements the following processing steps: • Remove features with more than 60% missing values.  Table IV. 2) Evaluation Metrics: We used classification accuracy (ACC), area under the receiver operating characteristics (ROC) curve (AUC) [38], macro-averaged F1 (MA F1), and weighted F1 (WT F1) as the evaluation metrics of overall model performance. Specifically, MA F1 is the unweighted average of F1 score of each class, whereas WT F1 is F1 average weighted by the number of instances of each class. Besides, recall (REC), precision (PRE), and F1 scores (F1) for each class are also measured to allow inspection of the model performance on individual classes. The metrics used in this study are formally defined with the following equations: where T P , T N , F P , and F N denote true positive, true negative, false positive and false negative, respectively, and represents the amount of instances in the data set. ROC(.) denotes the curve plot of true positive rate (y-axis) versus false positive rate (x-axis). F 1 P and F 1 N P denote the F1 scores of class P and NP, respectively, while Supp(P ) and Supp(N P ) denote the number of instances of class P and NP, respectively.

III. EXPERIMENTAL RESULTS
In this section, we present experimental results to demonstrate the effectiveness of our bowel sound detection model. We also show that the proposed Laplace HSMM strategy can introduce general improvements to achieve a more accurate bowel sound detection.

A. Comparative Results
The comparative results are presented in Table V. It can be observed that the proposed CNN + HSMM method achieved the best scores on 7 out of 10 metrics in total, with the remaining 3 metrics being comparable to the best performing baselines. This suggests that the proposed method is effective in neonatal bowel sound detection and has introduced significant improvements to current methods.
The RBF SVM 1 obtained the second best overall classification accuracy and the highest recall rate of peristalsis samples. However, this is at the cost of misclassification of more than half the non-peristalsis samples with the recall value being just 0.4530. By contrast, the corresponding recall of the proposed method is 0.7624, which is significantly better. In practice, accurately recognizing the segments without bowel sound is very important because it helps to avoid unnecessary tests and treatments, and reduce risks for patients. The best recall of non-peristalsis samples was achieved by RBF SVM 2, but meanwhile the model obtained the lowest ACC, AUC, MA F1, WT F1, and peristalsis recall. This means that the model could hardly detect peristalsis samples.
Given that the experimental dataset is very imbalanced, with 88% of samples being peristalsis, the overall accuracy can not provide enough insights of the true model performance. A decent accuracy value can be easily obtained by classifying all samples as the dominated class. Therefore, the AUC score is more critical in measuring the model overall performance because it is very sensitive to class imbalance. It is worth to note that our model is the only one which achieves an AUC score close to 0.84, outperforming the second best by nearly 5%. Besides, the MA F1, WT F1, and the individual class metrics also suggest that the proposed method outperformed the baseline algorithms.

B. Ablative Study of Different Kernels
In this subsection, we studied the efficacy of different kernels (2, 4, 8, and 16) based on fixed-sized input sequence of 24 and segment length of 6 s in our proposed CNN model. Detailed experimental results of classification performance of each kernel are presented in Table VI. While observing the table, we notice that the best kernel size is 8 in our work. This work reveals that lower kernel size is unable to capture the semantics of signal information accurately, whereas the higher kernel size could repeat the discriminating information, which, in result, degrades the classification performance.

C. Comparison with Other DL Methods
Since there are no well-established benchmark datasets and DL models to classify neonatal bowel sound features, we compared our method against two different variants of DL methods: LSTM (Fig. 5) and CNN-LSTM (Fig. 6) using fixed-sized input sequence of 24 and segment length of 6 s. Experimental results, which are obtained after Laplace HSMM, are presented in Table VII. While observing the table, we notice that our proposed CNN model outperforms other two versions of DL models.
Furthermore, we compared our method against two baseline methods (LSTM and CNN-LSTM) using statistical analysis. First, we performed the normality test using Jarque-Bera (JB) test, which showed that our data were not normally distributed. Therefore, we used nonparametric Wilcoxon signed rank test to assess the statistical significance [39] of our results compared to two baseline methods separately. While comparing the performance of our method against LSTM and CNN-LSTM at subject-level (for 49 subjects), we found that our method achieves significantly higher weighted PRE, weighted REC, WT F1 and ACC, with the p-value of 1.181e-3, 4.727e-3, 1.56e-3, and 4.727e-3 respectively, compared to LSTM and with p-values of 1.9e-4, 1.516e-05, 7.739e-06, respectively,

D. Ablative Study of Varying Sequence Lengths
In this subsection, we studied the relationship between sequence length and classification performance in our work. For this, we utilized 5 different sequence lengths (8, 16, 20, 24, and 26) with the fixed segment length of 6 s and evaluated our model. The experimental results, which are obtained after Laplace HSMM, are presented in Table VIII. The experimental  results show that the best sequence length is 24. Therefore, we stipulate that the sequence length of 24 is sufficient to capture the discriminating information for bowel sound classification.

E. Ablative Study of Varying Segment Length
In this subsection, we studied the relationship between sound segment length and classification performance. For this, we considered 4 different segments length (2, 4, 6, and 8 seconds) based on fixed-size input sequence of 24 and evaluated on the bowel sound dataset. The evaluation results, which are obtained after Laplace HSMM, are presented in Table IX. The experimental result shows that 8 s segments impart marginal improvement over 6 s segment for some performance metrics such as overall accuracy ( 0.9036 vs 0.8981). Nevertheless, 8 s segment length results in sharp decline in performance for NP class compared to 6 s segment length. Thus, throughout the whole work, we prefer 6 s segment length, which yields the balanced performance for both classes (P and NP) without compromising overall accuracy. From this result, we suspect that the lower-sized segments less than 6 s and higher-sized segments greater than 6 s are unable to efficiently capture the more semantic information representing the bowel sound.

F. Ablative Study of Laplace HSMM Refinement
The Laplace HSMM refinement strategy is one of the major contributions of this work. We have implemented the following experiments to provide more insights into the strategy.
1) Comparison with Conventional HSMM Practice: HMM, HSMM, and their extensions are widely used to improve time series classification [40], [41]. In conventional HMM and HSMM practices, an observation is normally referred to a predicted class label and the emission probability matrix is learned from training results. In this work, we define an observation as the probability of the segment belonging to class P, and we propose to use the Laplace distribution to model the emission probability.
We experimentally compared the performances of HSMM with conventional and Laplace-based emission matrix. Specifically, the conventional emission matrix E is obtained by where C denotes the confusion matrix of the training data. The improvement introduced by the conventional HSMM is very limited. One explanation is that the emission probabilities, learned from the training confusion matrix, are highly concentrated because the model training performance is usually good. That means, a hidden NP or P state usually has an over 80% chance to produce a corresponding NP or P observation. There will be a great penalty if HSMM assign a hidden state to an observation with different label. In fact, this is a dilemma. If the training stops too early, the classification results from CNN will be less reliable and the refinement will be meaningless. If the training stops too late, the emission probabilities will be concentrated. It is very difficult to control the training performance to an optimal level.
2) Ablative Study of Laplace HSMM: An ablative analysis was performed to quantify the improvement introduced by the proposed Laplace HSMM strategy. We compared the CNN classification performances with and without HSMM refinement. The results are summarized in Table XI. It can be observed that, after refinement, all 10 performance metrics of CNN got improved. The improvements ranged from 0.54% to 12.89%, where the most significant ones were made on the precision rate of non-peristalsis samples and the recall rate of peristalsis samples. This means a large portion of misclassified peristalsis samples were corrected. The results strongly suggest that the proposed Laplace HSMM strategy has introduced comprehensive improvements to CNN. It helps CNN to achieve a more accurate and more reliable automatic neonatal bowel sound classification.
3) The Impact of Laplace Parameter: The Laplace distribution is a parametric function. In this subsection, we investigated the variations of the proposed HSMM performance with the changes of the standard deviation value σ of the Laplace distribution. The result is visualized in Fig. 7. Although there are some fluctuations, general increasing trends can be observed for both ACC and AUC scores when σ increases from 0.1 to 5. Afterward, the model performance maintains at a relative stable level. The reason behind is that a small σ leads to concentrated emission probabilities, thus limiting the refinement. With the increase of σ, the Laplace distribution becomes more spread and various hidden state combinations are allowed to be examined. 4) Generalization Performance: Although the Laplace HSMM is initially proposed to improve the performance of CNN model, it can be applied to other bowel sound classification algorithms for result refinement. In this part, we investigated the generalization performance of the proposed Laplace HSMM strategy by applying it to the baselines. The results are summarized in Fig. 8. It is worth to note that, both the ACC and WT F1 scores of all 13 baselines gained different levels of increases after refinement by the proposed Laplace HSMM. The most significant one is AdaBoost, with the ACC score rising from 0.8371 to 0.8913 and the WT F1 score rising from 0.8511 to 0.8877. Out of the total 13 baselines, 8 baselines gained improvements on the AUC score, and 11 baselines gained improvements on the MA F1 score, respectively. KNN and RF are the only two baselines not being improved on both AUC and MA F1 scores. The reason can be found in Table V. The two baselines performed poorly on identifying non-peristalsis samples. This might confuse the HSMM model which might treat the predicted non-peristalsis samples as noises and then label them as peristalsis.
Despite the occasional cases, the overall result still provides the strong evidences that the proposed Laplace HSMM is able to introduce general improvements to help deliver a better and more reliable neonatal bowel sound classification.

IV. DISCUSSION
This paper proposed a methodology based on CNN and Laplace HSMM to detect bowel sound on infants for the first time. The experimental results have proven the effectiveness of the proposed method and demonstrated its superiority as compared to the baselines. Furthermore, we have also shown that the proposed Laplace HSMM refinement strategy can be flexibly applied to existing methods without changing their original structures to help provide a better bowel sound classification performance.
While these results are promising, there is still room for improvements. Firstly, the proposed method learns to distinguish peristalsis from non-peristalsis from MFCC only. Although the effectiveness of MFCC has been proven, whether it is the best choice is still unknown because it does not contain much time domain and subband-specific information which might also present differences between peristalsis and non-peristalsis samples. Therefore, it is worth to investigate the advantages of other potential features. In fact, an investigation of the features listed in Table III has been performed. For each round of LOPOCV, feature selection is performed on training data to find out the top 20 most representative features. Although the best feature list exhibits some inter-subject variations, we can still summarize the frequenters, as shown in Table  XII. It is clear that MFCC, Skewness, Kurtosis, and Spectral Subband Centroids and Energies are commonly determined as representative features. The finding has provided us a good direction to work on. However, to design a practical model that can make the best of each of these features is still challenging, which will be our future work. Secondly, the experimental abdominal sounds were collected in NICU with a commercial DS without noise cancellation function. They are very noisy, containing rub noise, alarm noise, monitor noise, and talk noise. This may be one possible reason of why most of the baselines failed to give a good performance even if the used features have been proven effective in adult bowel sound detection. We did not specially handle the noise problem because the neonatal abdominal sound is very weak. The sound recording might lose important information while the noise is suppressed. Although the proposed method has been proven to be able to achieve a promising result on these noisy recordings, having good quality data is still necessary. This will allow our model to allocate more learning capacities to catch the key differences between peristalsis and non-peristalsis samples instead of wasting resources on reducing the influence of noise. Moreover, the presence of noises can also impose a great challenge for the follow-up analysis, in which we need to further determine whether the peristalsis sound is disease related or not. Therefore, in our next study, we will use a more advanced DS and denoising methods to reduce the noise interference.
Thirdly, the abdominal sound segments are highly imbalanced, with nearly 88% being peristalsis samples. This explains why the proposed method reported a precision rate of just 0.5589 for non-peristalsis samples, given that the recall rates for both classes were quite good. Even though just a small portion of peristalsis samples were misclassified, there was still a large impact to the precision of nonperistalsis samples. Moreover, the imbalance problem limited the available data because class balancing by downsampling had to be performed to reduce model training bias. The model complexity is thus limited. Our team of neonatologists are currently working to solve the problem by continuously collecting data. Although this will be a very time-consuming process, it will bring a lot of benefits, not only to our current work but also to our future study of pathology of bowel sound.
Fourthly, the proposed method was developed only based on the abdominal sounds recorded from the right lower quadrant. Whether the abdominal sounds from the other three quadrants (right upper, left upper and left lower) help to deliver a more accurate peristalsis detection and how to develop a mode to effectively analyze the abdominal sounds from all four quadrants are still to be explored. This work has shown how automatic analysis of neonatal bowel sound can be done from DS collected abdominal sounds. DS auscultation will have been a mainstream methodology for diagnosis of bowel condition in the foreseen future. Although point-of-care ultrasound can provide more detailed information on bowel peristalsis, bowel wall thickness and bowel vascularity [42], it can be only performed by medical professionals within hospital settings. By contrast, DS auscultation is much more convenient without imposing constraints on professional knowledge and geographical locations. It facilitates the development and application of telehealth. One can use DS to record bowel sound in home environment and send out the recording with the help of smartphones and high-speed networks for remote auscultation. The proposed bowel sound detection model is one of the major contributions of this study, which can greatly improve the efficiency of the bowel condition tele-diagnosis process. The current work is a preliminary step towards automated classification of different types of peristalsis sounds and pathological bowel conditions. With this model, clinicians can directly listen to their sounds of interest, instead of going over every detail. The ability to use other biological sounds such as heart and lung sounds [43] with the proposed methodologies could be explored.
Automated detection of peristalsis sounds is an important step towards automated detection of neonatal bowel conditions from abdominal sounds. The next steps will be to differentiate and characterise various types of bowel sounds, validate the results against other modalities such as ultrasound and develop diagnostic models to detect bowel dysfunctions, toward developing a clinical decision support system.

V. CONCLUSION
This paper presented an effective neonatal bowel sound detection method, which uses CNNs to perform an initial classification which is then refined with Laplace HSMM. The proposed deep learning model enables automatic learning of useful patterns to distinguish peristalsis from non-peristalsis. The Laplace HSMM refinement strategy optimizes the predicted events by considering their hidden time dependency. Abdominal sounds recorded from 49 newborn infants admitted to our tertiary NICU were used as experimental data. The results showed that the proposed method can accurately identify peristalsis and non-peristalsis sounds with an overall accuracy of around 90% and an AUC score of around 84%. In addition, the Laplace HSMM refinement strategy has been proven to be able to introduce general improvements to other bowel sound detection models to help perform a more reliable detection.
This work opens opportunities for further research to develop continuous acoustic monitoring or "mapping" to quantitate patterns of bowel acoustic signatures that may allow preclinical detection of NEC and Septicaemia.