Classification of Cough Sounds Using Spectrogram Methods and a Parallel-Stream One-Dimensional Deep Convolutional Neural Network

Currently, a subjective method is used to diagnose cough sounds, particularly wet and dry coughs, which can lead to incorrect diagnoses. In this study, novel emergent features were extracted using spectrogram methods and a parallel-stream one-dimensional (1D) deep convolutional neural network (DCNN) to classify cough sounds. The data of this study were obtained from two datasets. We employed the Mel spectrogram, chromagram constant-<inline-formula> <tex-math notation="LaTeX">$Q$ </tex-math></inline-formula> transform, Mel-frequency cepstral coefficient, constant-<inline-formula> <tex-math notation="LaTeX">$Q$ </tex-math></inline-formula> cepstral coefficient, and linear predictive code coefficient to conduct features analysis. The maximum, mean, variance, and standard deviation values of the original spectrogram as well as the maximum first and second derivatives of this spectrogram were extracted and fused to create a single-feature vector. We adopted two types of features: single features and combined features. Each design was restructured according to the magnitude of features with high discrimination power. A parallel-stream 1D-DCNN was developed for classifying cough sounds accurately. We compared the results obtained using the aforementioned network with those obtained using a single-stream 1D-DCNN. We found that the parallel-stream network outperformed the single-stream network for some feature sets. The developed network achieved <inline-formula> <tex-math notation="LaTeX">$F1$ </tex-math></inline-formula> scores of 98.61% and 82.96% for the first and second datasets, respectively. The concatenation of layers at the flattening level resulted in an <inline-formula> <tex-math notation="LaTeX">$F1$ </tex-math></inline-formula> score of 99.30% in dataset one. Moreover, layer merging strategies exhibited a better performance at the second convolutional layer level than at the flattening layer level in many cases.

To the best of our knowledge, a few studies have adopted a 89 1D DL model for cough detection, and most relevant studies 90 have adopted a single-stream model for cough detection. 91 For example, Baramulari et al. [18] classified cough sounds 92 by using a bidirectional long short-term memory model. 93 Hassan et al. [19] used a recurrent neural network to detect 94 COVID-19. Amrulloh et al. [20] employed a neural network 95 to classify pneumonia and asthma infections. In the present 96 study, we examined the performance of a 1D-CNN, gated 97 recurrent unit (GRU) model, and a neural network for cough 98 detection. We found that the 1D-CNN model outperformed 99 the other two models. Therefore, the 1D-CNN model was 100 used for further analysis in this study. 101 Insufficient data are a challenge encountered in studies on 102 cough sound [21], [22], as well as a class imbalance problem 103 [23], [24]. A similar problem was encountered in this study. 104 Of the two collected datasets, one contained 118 wet cough 105 sounds and 170 dry cough sounds, and the other contained 106 389 wet cough sounds and 413 dry cough sounds. These 107 datasets exhibited the class imbalance problem. Therefore, 108 we used the weighted F1 score [25] and Matthews correla-109 tion coefficient (MCC) [26] as metrics for assessing model 110 performance. Furthermore, the two datasets contained signals 111 with different dimensions. We used a zero-padding system 112 to address the varying dimensions of cough signals. The 113 problem of insufficient data was addressed using the data 114 enhancement technique. 115 The main contributions of this study are as follows: 116 1) Sounds of wet and dry coughs were classified using 117 novel features extracted using spectrogram methods 118 and a parallel-stream 1D deep CNN (DCNN).

119
2) The features extracted using spectrogram methods 120 were analyzed using a novel method to classify wet and 121 dry coughs.

122
3) Feature structures were designed, and two techniques 123 were developed for restructuring the positions of com-124 bined features and were compared to determine the 125 better technique. 126 4) A parallel-stream 1D-DCNN was developed, and the 127 performance of this CNN was compared with that of 128 a single-stream 1D-CNN. The developed model dif-129 fers from existing related models [15], [36] in three 130 ways: (1) it does not contain a maximum pooling layer, 131 (2) layer concatenation occurs in its flattening layer, 132 and (3) it contains a few layers as a small network might 133 prevent overfitting [28]. Moreover, the performance 134 benefits of concatenating layers at different levels were 135 examined. 136 5) Model performance achieved with layer merging strate-137 gies at different levels in a parallel-stream network was 138 examined.

139
The rest of this article is organized as follows. Section II 140 provides an overview of the related research. Section III 141 details the methodology used for constructing the designed 142 system. Section IV describes the proposed DL models. 143 Section V presents the experimental results and a discussion 144 on the results. Finally, Section VI provides the conclusions of 145 this study. The basic concept of data enhancement in machine learning 196 involves increasing the quantity of training data; however, 197 data enhancement also can be performed to enrich data in 198 a dataset [27]. Data enhancement can be performed using 199 two approaches: image-and audio-based approaches. The 200 audio-based approach was used in the present study. Two 201 strategies were used in this study to enhance the quantity of 202 data: time stretch and pitch shift. In the time stretch method, 203 we stretched the duration of cough signals by factors of 204 1.07 and 0.5. The factor of 1.07 was used to accelerate a 205 cough signal, and the factor of 0.5 was used to decelerate a 206 cough signal. Pitch shift was performed using factors similar 207 to those used in [27]. 208 The results indicated that after data enhancement, the num-209 bers of dry and wet cough signals in dataset 1 increased 210 from 170 to 850 and from 118 to 590, respectively. More-211 over, the numbers of dry and wet cough signals in dataset 2 212 increased from 413 to 2065 and from 389 to 1945, respec-213 tively. Overall, the total number of cough signals increased 214 from 288 to 1440 for the first dataset and from 802 to 4010 for 215 the second dataset.

217
A padding system is used to overcome the problem of mul-218 tivariable bit lengths of cough signals in a dataset. In this 219 study, the bit lengths of the signals with short bit lengths were 220 increased to the maximum value. Thus, a fixed bit length was 221 achieved for all the cough signals (bit length is the size of a 222 signal). Inspired by the random padding technique proposed 223 by Dong et al. [36], we created a zero-padding system instead 224 of a random padding system. The procedures for creating a 225 zero-padding system are described in the following text.

242
The following methods were used for feature extraction: Essentially, a spectrogram is obtained after four steps: pre-253 emphasis, framing, windowing, and short-time Fourier trans-254 form (STFT). The spectrogram S(n, k) [40] is the squared 255 magnitude of X (n, k), which is expressed as follows: where x(t) is the cough signal, n is the Fourier coefficient, k 258 is the time frame, w(t) is the windowing function, and X (n, k) 259 is the STFT in the complex number.
A Mel spectrogram (MELSPEC) is an auditory system 263 derived by passing a cough signal through an STFT filter and 264 a Mel filter bank [41]. A Mel spectrogram is expressed as 265 follows [42], [43], [44]: where M (m, k) is the generated Mel spectrogram, and (m) 268 represents a triangular Mel filter bank with m Mel-frequency 269 bands. The Mel frequency is calculated using the following 270 formula: In a Mel spectrogram, cough intensity bands are repre-273 sented equally in Mel frequencies; thus, capturing different 274 attributes from each frequency band will provide interesting 275 results. Table 3 presents the code procedures used to obtain 276 attributes from a Mel spectrogram. MFCC is typically calculated after passing a cough signal 295 through an STFT filter, a Mel filter bank, and a discrete 296 cosine transform filter [45]. The MFCC is calculated using 297 the following equation: CHROMA-CQT and its derivatives were generated. Second, 334 the maximum, mean, variance, and standard deviation of the 335 CHROMA magnitude as well as the maximum magnitude of 336 the derivatives of the CHROMA-CQT were generated. The 337 overall shape of a single-feature vector developed using the 338 CHROMA-CQT was (1440,120).
The CQCC was developed for automatic speaker verification. 341 It has also been applied to distinguish between patients with 342 asthma and healthy people [48]. The CQCC is determined 343 using three steps: (1) the CQT is calculated, after which the 344 amplitude of CQT is converted into decibels; (2) the MFCC 345 is used to obtain a 2D CQCC; and (3) emergent features are 346 extracted. The aforementioned steps are described in Table 4. 347 The shape of a single-feature vector obtained from the CQCC 348 was (1440, 240) in this study.

350
The LPC is a vocal tract feature used to characterize the 351 spectral envelope of a speech signal. This coefficient has 352 been used for classifying cough sounds [49], [50], with suit-353 able results. After extracting the LPC [37], [51], its first 354 and second derivatives are calculated. Subsequently, all the 355 computed features are fused to obtain a single LPC feature. 356 The shape of a single-LPC-feature vector generated in this 357 study was (1440,81). Fig. 3 illustrates the features extracted 358 using some spectrogram methods in this study.  feature structures are single features and combined features.  deviation) and mutual information value. The two restruc-396 turing techniques are detailed in Table 5. These techniques 397 were analogous, with the difference being that the mean 398 absolute deviation was calculated using (13), whereas the 399 mutual information [52], [53] was calculated using (14).
where n is the number of data points, x t is the value of 403 each data point in a series, x av is an average value of the 404 data, MAD(t) is the mean absolute deviation of the data, 405 p(x, y) is the joint probability of variables x and y, p(x) is 406 the probability of variable x, and p(y) is the probability of 407 variable y.

409
The main DL architecture used in this study was a parallel-410 stream 1D-DCNN. The performance of this network was 411 compared with that of a single-stream 1D-DCNN. Both 412 the aforementioned networks exhibited the basic structure 413 of a CNN, which comprises an input layer, a hidden 414 layer, and an output layer. The parallel-and single-stream 415 1D-DCNNs were constructed using a Keras library and exe-416 cuted in TensorFlow-GPU. The aforementioned networks are 417 described in the following text.

419
As depicted in Fig. 4, the constructed single-stream 420 1D-DCNN contained one input layer, three convolutional 421 layers, one flattening layer, one dense layer, and one out-422 put layer. The first convolutional layer of this network used 423 the regularizer l2 (0.001) kernel. The rectified linear unit 424 activation function was used in all the layers except the last 425 layer, in which the softmax activation function was used. Each 426 convolutional layer had a stride of 1 and the same padding. 427 After the dense layer, a 50% dropout was used.

452
The weighted F1 score [25] is expressed as follows: The MCC [54] is expressed as follows: an Nvidia GeForce GTX 2060 graphics card with 6 GB 465 VRAM, and a 1-TB hard disk drive. Audacity version 466 3.1.3 was used for signal segmentation in this study. Audacity 467 is a multifunctional tool that enables users to import, edit, 468 export, and record audio files [55]. A Librosa library [37] 469 was used to analyze cough signals through processes such as 470 audio wave loading and spectrogram extraction.  Two methods were used in this study for restructuring com-491 bined features, and the results obtained with these methods 492 are presented in Table 7. The restructuring method based 493 on discrimination power outperformed that based on the 494 mutual information value. The F1 score obtained with the 495 method based on discrimination power was 1.4% and 1.39% 496 higher than that based on the mutual information value 497 for two feature sets. Therefore, the restructuring method 498 based on discrimination power was selected for further 499 analysis. Single-feature vectors extracted from all the spectrograms 503 were aggregated, and the method based on discrimination 504 power was used to restructure combined features. Sub-505 sequently, these features were input to a single-stream 506 1D-DCNN. The classification results of this network were 507 VOLUME 10, 2022  In the parallel-stream 1D-DCNN, every single-feature vector 538 was simultaneously fed to different inputs, and the features 539 extracted from parallel streams were subsequently concate-540 nated to form a merged layer. The resulting features were 541 passed through a dense layer before being classified at the 542 output layer (Fig. 4). The classification results obtained for 543 the parallel-stream network are presented in Table 9. For dataset 1, the best classification results were obtained 545 for the MFCC + LPC and MFCC + LPC + MELSPEC + 546 CHROMA feature sets. The F1 score and MCC for these fea-547 ture sets were 98.61% and 0.971, respectively. An F1 score 548 of 97.91% was obtained for the MFCC + LPC + MELSPEC 549 and MFCC + LPC + MELSPEC + CHROMA + CQCC 550 feature sets. Moreover, the worst classification results were 551 observed for the CHROMA + LPC + MELSPEC feature 552 set, with the F1 score being 93.37% and the MCC being 553 0.863. Fig. 5(a) shows the confusion matrix of the feature 554 set for which the best classification results were obtained. 555 Dataset 1 contained 170 dry cough sounds and 118 wet cough 556 sounds. The classification results of the proposed parallel-557 stream network contained three false positives and one false 558 negative. For dataset 1, the aforementioned network predicted 559 168 cough signals as dry coughs and 120 cough signals as wet 560 coughs.

561
For dataset 2, the best classification results were obtained 562 for the MFCC + LPC + MELSPEC + CHROMA + CQCC 563 feature set. The F1 score and MCC for this feature set 564 were 82.96% and 0.663, respectively. The worst classification 565 Overall, the single-stream network required fewer 593 training parameters than did the parallel-stream net-594 work (3 866 954 vs. 3 998 042). The results presented in 595 Tables 8 and 9 indicate that (1) training multiple paral-596 lel networks concurrently does not guarantee excellent 597 classification results, but selection of input features are 598 important, and that (2) the simultaneous aggregation of many 599 features in a single-stream network does not result in high 600 performance. We examined whether the classification performance of the 604 constructed parallel-stream network could be improved by 605 concatenating layers at different levels. We trained the pro-606 posed parallel-stream network [ Fig. 4(b)] and then modified 607 the network by performing concatenation at the second and 608 third convolutional layers. The MFCC + CQCC feature set 609 was selected as the input of the aforementioned network 610 because of the similarities in the dimensions of these features. 611 Better classification results were obtained when concatenat-612 ing layers at the flattening level than when concatenating lay-613 ers at other levels. When concatenating layers at the flattening 614 level, the F1 score was 99.30%, and the MCC was 0.985 615 (Table 10). We compared the classification results obtained with the 619 proposed parallel-stream network when using different strate-620 gies for merging layers, such as addition, multiplication, 621 maximization, and concatenation. Layer merging involves 622 combining two or more models or layers. Four layer merg-623 ing strategies were adopted at different levels in this study 624 (Table 11). At the flattening level, excellent classification 625 TABLE 11. Classification results obtained when adopting different layer merging strategies at different levels using dataset 1. VOLUME 10, 2022 performance but required a long classification time and 681 numerous training parameters. 682 We also compared the classification performance achieved 683 when concatenating layers at different levels. When the 684 MFCC + CQCC feature set was used, better classification 685 results were obtained when concatenating layers at the flat-686 tening level (F1 score of 99.30%) than at other levels.

687
Finally, the classification performance of the proposed 688 parallel-stream network was examined under four layer merg-689 ing strategies: addition, multiplication, concatenation, and 690 maximization. These strategies were implemented at two 691 levels: the second convolutional level and flattening level. 692 Better classification results were obtained for all the layer 693 merging strategies, except the concatenation strategy, at the 694 second convolutional level than at the flattening level. The 695 best classification results were obtained with the maximiza-696 tion and concatenation strategies at the second convolutional 697 and flattening levels, respectively.

698
In the future, we will develop a transfer learning algo-699 rithm that can execute the spectrogram methods used in 700 this study. The performance of this algorithm will then be 701 compared with the parallel-stream network designed in this 702 study. Moreover, a novel method will be adopted to increase 703 the quantity of data, and this method will be compared with 704 the data enhancement method adopted in this study. 705 extraction and classification of partial discharge UHF signals,'' in Proc.