RS-MSConvNet: A Novel End-to-End Pathological Voice Detection Model

Recent studies have reported the success of multi-scale convolution neural network (MSConvNet) model for many classification applications due to its powerful ability of exploring multi-scale convolution block to extract multi-scale representations to make a detection. However, a new design based on MSConvNet for pathological voice detection has not been explored. In this paper, we propose RS-MSConvNet, a novel end-to-end MSConvNet model using raw speech for pathological voice detection. The main contribution of the proposed RS-MSConvNet method is to exploit the multi-scale convolution block, followed by spatial-temporal feature block, and fully connected layer as classification. In addition, to further improve accuracy performance, we propose a novel hybrid detection model by integrating the feature extraction ability of the RS-MSConvNet model and the classifier of support vector machine (SVM) method, called RS-MSConvNet-SVM model. The effectiveness of our proposed models is investigated using the TORGO database. The experimental results reveal that the RS-MSConvNet model outperforms other baseline methods in the speaker-independent task. Moreover and as compared to the RS-MSConvNet-SVM model, a further improved accuracy is obtained using the RS-MSConvNet-SVM model. These outcomes exhibit that our proposed models are useful for pathological voice detection.


I. INTRODUCTION
Pathological voice detection is a technique of determining pathological voice or healthy voice from a provided utterance signal. It plays an important role in voice healthcare systems [1] such as voice clinics [2] and telemonitoring application [3], [4], [5] because the detection of changed speech is a diagnostic tool to identify the onset of disabling physical symptoms [6], where the results are exploited to screen patients at risk of having certain diseases. Moreover, the pathological voice detection is an essential pre-processing The associate editor coordinating the review of this manuscript and approving it for publication was Emanuele Lattanzi . step for automatic speaker recognition for dysphonic voice assessment [7] and dysarthric speech recognition [8]. In this study, we focuse on a pathological voice detection, which is a subject area of the pattern recognition task in the field of biomedical and health informatics.
Typical pathological voice detection system can be divided into two groups: traditional pipeline system [9] and modern end-to-end system [10]. In the earlier studies [11], the systems usually consist of the front-end feature extraction and the back-end classifier. Based on traditional pipeline systems, the handcrafted design feature extraction converts speech signal into parametric representation while the back-end classifier learn feature representation for predicting pathological/healthy voice class. For modern end-to-end systems that can extract features without using the handcrafted design feature extraction, a deep learning-based classifier used for predicting target classes is learned using a raw speech or its spectrogram. A brief survey based on existing traditional pipeline systems and modern end-to-end systems approaches for pathological voice detection is reviewed below.
Based on the traditional pipeline approach, most existing studies in pathological voice detection have focused on exploring effective hand-crafted design feature extraction with effective classifiers. Researchers have introduced various feature extraction methods for pathological voice detection. Mel-frequency cepstral coefficients (MFCC), Linear Predictive Cepstral Coefficients (LPCC), Linear Prediction Coefficients (LPC), Multi-Dimensional Voice Program (MDVP)-based features were proposed in [12]. Harmonicsto-noise ratio [13], Jitter [13], Shimmer [14], Kullback-Leibler divergence (KLD) histogram [15] and KLD higher amplitude suppression spectrum [15] were proposed for pathological voice detection. Autocorrelation and entropy features in different frequency regions were proposed in [16]. In addition to using the above mentioned individual features, the openSMILE set or Glottal source set-based fusion feature were introduced in order to combine acoustic features with statistical function sets or combine frequency-domain glottal and time-domain glottal feature sets with statistical function sets, respectively. Moreover, the combination of the openSMILE set and Glottal source set-based feature fusion was introduced [17] to fuse two merits based on different features. For the classifier, support vector machine (SVM) have been utilized as popular classifier in most previous studies [18], [19], [20], [21] because it can provide promising result for pathological voice detection. In addition to using SVM, researchers have applied various classifiers such as artificial neural networks [22], [23], linear discriminant analysis [24], Gaussian Mixture Model [25], and decision trees [26]. In all these traditional pipeline approaches, the ability of detecting pathological speech from healthy speech is strongly dependent on the effectiveness of effective handcrafted design feature extraction. This suggests that the detection performance requires expert knowledge in speech processing to devise relevant features.
Regarding pathological voice detection using end-to-end systems, previous works [11], [27], [28] have shown that they do not require expert feature engineering because deep learning models can be trained using either raw speech signal or its spectrum. For example, the combinations of convolutional neural network and multilayer perceptron (CNN-MLP)/long short-term memory (CNN-LSTM) using raw speech signal were proposed in [17]. The results showed that the CNN-MLP and CNN-LSTM could provide good results for pathological voice detection. However, using raw speech signal with any modification was not efficient enough as an input for training the end-to-end model under small training data. To further improve the end-to-end CNN-MLP and CNN-LSTM, the authors of [11] proposed to use glottal flow signal to replace raw speech signal as the input. The results showed that end-to-end CNN-LSTM and CNN-LSTM using glottal flow signal performed better than conventional CNN-MLP and CNN-LSTM using raw speech signal. Even though the end-to-end CNN-MLP and CNN-LSTM using either raw speech or glottal source signals could provide encouraging results, there is still an open research subject to design a new end-to-end model for pathological voice detection.
In this paper, a new end-to-end multi-scale convolution neural network architecture using raw speech, RS-MSConvNet is proposed for pathological voice detection. The main idea of the proposed architecture is to exploit multi-scale convolution block to scale the input information into different scaled representation, followed by spatial-temporal feature block and fully connected (FC) layer as a classifier block. In addition, to further improve detection accuracy, we propose a hybrid of RS-MSConvNet and SVM (RS-MSConvNet-SVM) models. Here, SVM classifier was explored to learn the automatically extracted features derived from fully trained RS-MSConvNet model. RS-MSConvNet and RS-MSConvNet-SVM provide promising results for speaker-independent pathologcal voice detection.
The contributions of this article can be summarized as follows: 1) A novel end-to-end model architecture, RS-MSConvNet is proposed to learn raw speech. The proposed RS-MSConvNet architecture which is endto-end does not require expert knowledge in feature engineering.
2) We investigate our model on TORGO dataset. Here, the proposed RS-MSConvNet model performs comparably to other baseline systems in a speaker-independent approach.
3) The RS-MSConvNet is also modified using SVM method to replace FC layer as classification to learn the automatically extracted features derived from fully trained RS-MSConvNet model. The modified RS-MSConvNet is referred to as a hybrid RS-MSConvNet-SVM model architecture. The RS-MSConvNet-SVM model provides improved accuracy result, compared to the RS-MSConvNet classifier.
The rest of this paper is organized as follows: Our proposed methods are introduced in Section II. Section III describes pathological voice detection setup including the details of the database, network training, baseline method, and experimental evaluation. In Section IV, the results and discussions are presented. Section V presents our conclusion.

II. PROPOSED METHOD A. RS-MSConvNet
The proposed RS-MSConvNet framework for pathological voice detection consists of pre-processing block, multi-scale VOLUME 10, 2022  convolution block, spatial-temporal feature extraction block, and classifier block as shown by the flowchart in Fig. 1. The configuration of the proposed RS-MSConvNet model is summarized in Table. 1.

1) PRE-PROCESSING BLOCK
This subsection describes how to form input data for training the RS-MSConvNet model. Pre-emphasis is initially employed to compensate the high-frequency component of the input speech signal. Next, the framing operations is used. In this paper, a 20 ms frame length and 10 ms frameshift of raw speech signals were divided into several speech frames and then a Hamming window is applied to enhance the harmonics and smooth the edges of the framed speech signals. Finally, a 2 dimension (2D) input data for training the proposed RS-MSConvNet model is formed by stacking the speech frames.

2) MULTI-SCALE CONVOLUTION BLOCK
Motivated by [29], [30], and [31], the feature pyramid networks based on Multi-scale convolution block has been proven to be effective feature technique in the built-up areas detection using synthetic aperture radar images [32] and electroencephalography seizure detection [33] because Multi-scale convolution block can extract multi-scale semantics information and make a more precise prediction by mean of gathering more robust semantics information of scaled features. In this paper, Multi-scale convolution block was implemented to extract the 2D-input data into multi-scale features, making the designed detection model able to learn objects across a large range of scales. Each block was designed to extract the input fixed information in half the scale of the previous layer's resolution. The block could automatically learn the weight to effectively distinguish the valuable level features while reducing the signal to half.
In this block, the input data are shaped as (C, N , T ) where C, N and T are defined as the number of channel, number of frames, and number of samples in each frame, respectively. Next, as seen in block (b), each 2D-convolution layer performs a convolution to reduce the input size with the same kernel size of (2, 2), a stride of 2, and no padding. By this configuration, the number of rows (frames) and columns (its samples) from k convolution layer become half of k − 1 convolution layer providing the output size from the k layer to be (1, N 2 k , T 2 k ). Here, outputs from the second to forth layers are used as the input for the next block for further extraction of spatial and temporal representation based on different field of views.

3) SPATIAL-TEMPORAL FEATURE EXTRACTION BLOCK
In this block, the objective is to extract spatial-temporal features from each output's scale of the Multi-scale convolution block. Here, two regular 2D-convolution layers are used to capture three last scaled output of the block. For the first 2D-convolution layers, three kernel sizes of ( N 4 , 1), ( N 8 , 1), ( N 16 , 1) with a stride of 2, no padding, and output channels of 32 are used to capture the outputs from the second layer, third layer and forth layer of previous phase, respectively. For the second 2D-convolution layers, the same kernel size of (1,4) with a stride of 2, no padding, and channel outputs of 16 are used to extract the different output of the first layers. Finally, global average pooling is applied to the output passed from the two regular 2D-convolution layers. By this configuration, a total of 48 features (16 representations per each scale) are obtained to be the input for next block.

4) FULLY CONNECTED LAYER BLOCK
After two convolutional layers and global average pooling, the spatial feature module s' outputs of different scales are augmented and fed to FC layers. Log softmax is used as our last layer for predicting binary classes. Based on Log softmax, the logarithm of the prediction probability of binary classes is computed as follows: ).
where x i is the input vector with the i th element and j is the number of classes (possible outcomes).
To calculate classification loss, cross entropy is implemented in this study. This loss function calculates the similarity between the label and predicted probability values as follows: whereŷ and y are the predicted probability and the true label, respectively.

B. RS-MSConvNet-SVM
SVM-based method has been proven to be effective for both classification and regression problems, so it has been utilized for many applications [34] such as face recognition [35], electroencephalography seizure detection [36], and automatic emotion speech recognition [37]. Moreover, most previous studies have used SVM-based method as baseline classifier for pathological voice detection because it can deal with two-class classification problem. In this paper, the SVM-based method is applied to learn automatically optimized features based on the RS-MSConvNet model. Motivated by [38] which integrated CNN as a trainable feature representation and SVM as a classifier, a hybrid CNN and SVM (CNN-SVM) provided better accuracy results than CNN model for tumor detection. This was attributed to the fact that the developed model combined the advantages of CNN and SVM models. Similarly, hybrid RS-MSConvNet and SVM (RS-MSConvNet-SVM) as shown in Fig. 2 is proposed by using SVM as the classification to replace the FC layers after the RS-MSConvNet classifier was fully trained as shown in Fig. 2 (a). In this paper, to construct the SVM in a hybrid model, we adopt radial basis function (RBF) and determined penalty parameter C and the optimal kernel parameter γ by investigating the validation data on the hybrid model learned by training data. Both training and validation data are explained in next section.
The implementation process of the RS-MSConvNet-SVM model is shown in Fig. 2 (b) and can be summarized as follows: 1) For the training process, the samples of training set were fed to RS-MSConvNet model.
2) After the RS-MSConvNet classifier was fully trained, the corresponding feature information could be automatically extracted for each input map.
3) The FC layers were replaced with SVM-based classifier to learn the automatically extracted feature vectors derived from the fully trained RS-MSConvNet classifier.
4) For the test process, the samples of test set were fed to the fully trained RS-MSConvNet classifier to obtain the automatically extracted features as the test feature representation.
5) The test feature data was fed to the well-trained SVM for predicting healthy or pathological class.

III. EXPERIMENTAL SETUP
A. DATABASE TORGO [39] and UA-Speech [40] are commonly used databases for pathological voice detection. In this paper, TORGO database is used to investigate our RS-MSConvNet VOLUME 10, 2022  and RS-MSConvNet-SVM models. The main reason for using this database is that the database is more challenging than UA-Speech database due to the limited data, which makes the end-to-end models based on the TORGO database difficult to achieve better accuracy than the end-to-end models based on the UA-Speech database as seen in [17]. Moreover, the results are also directly compared with the experimental settings as in [17]. The publicly available TORGO corpus was produced by three females (F01, F03, F04) with dysarthria, three healthy females, five males (M01, M02, M03, M04, M05) with dysarthria, and five healthy males (MC01, MC02, MC03, MC04). In this database, participants without dysarthria recorded approximately 900 utterances on average while participants with dysarthria recorded approximately 400 utterances on average. Further details of the TORGO database can be seen in [39]. All speech utterances were sampled at 16 kHz. In this study, since a substantial amount of silence was contained in the TORGO database, it need to be removed for training/testing classification model. To conduct the speaker-independent pathological voice detection as advised in [17] and [41], the database is divided into three sets: training subset (3,125 healthy and 1,491 pathological utterances with 3.5 hr), validation subset (944 healthy and 795 pathological utterances with 2 hr), and testing subset (2,087 healthy and 861 pathological utterances with 3 hr). Table. 2 summarizes all three subsets of TORGO database used for our experiments.

B. NETWORK TRAINING
In this paper, we used the PyTorch v1.10.1 framework to build the proposed method. NVIDIA RTX3090 with 24 GB memory was used to train the networks. Adam optimizer was exploited to optimize the loss function in each iteration of training process. The model parameters for training the RS-MSConvNet and RS-MSConvNet-SVM models are listed in Table. 3.

C. BASELINE SYSTEMS
Based on the same database and training/testing condition, the effectiveness of the proposed RS-MSConvNet and RS-MSConvNet-SVM methods are compared with the results of five baseline system groups: OpenSMILE+SVM, Glottal+SVM, OpenSMILE-Glottal+SVM, and conventional end-to-end methods, modified end-to-end methods using glottal flow signal. 1) OpenSMILE+SVM methods: two acoustic feature extraction sets obtained using the OpenSMILE toolkit were used as the input feature information for the classifier. First OpenSMILE s' acoustic feature sets (OpenSMILE-1) with a total of 384 dimensions ((16 dimensions of the chosen acoustic features + 16 dimensions) × 12 statistical functions) and second OpenSMILE s' acoustic feature sets (OpenSMILE-1) with a total of 6552 dimensions (56 dimensions of the chosen acoustic features + 56 dimensions) × 39 statistical functions) were used as the input for the SVM classifier. The lists of OpenSMILE-1 set and OpenSMILE-2 set with its statistical information are summarized in Table. 4. For this baseline system, the SVM-based classifiers using the OpenSMILE-1 and OpenSMILE-2 sets are referred to as OpenSMILE-1+SVM and OpenSMILE-2+SVM, respectively.
2) Glottal+SVM methods: two glottal feature sets were used as the input feature for the SVM-based classifier. 3) OpenSMILE-Glottal+SVM methods: To take the advantages of two types of feature extraction sets mentioned above, the OpenSMILE-1/OpenSMILE-2 and Glottal-1/Glottal-2 sets were joined as the input for further improving the SVM-based classifier. Here, SVM using joint OpenSMILE-1 and Glottal-1 sets, joint OpenSMILE-1 and Glottal-2 sets, joint OpenSMILE-2 and Glottal-1 sets, and joint OpenSMILE-2 and Glottal-2 sets were referred to as OpenSMILE-1-Glottal-1+SVM, OpenSMILE-1-Glottal-2+SVM, OpenSMILE-1-Glottal-1+SVM, and OpenSMILE-1-Glottal-2+SVM, respectively. 4) Conventional end-to-end methods: the CNN-MLP and CNN-LSTM methods were used as baseline end-to-end method using raw speech. 5) Modified End-to-end methods using glottal flow: In similar way as conventional end-to-end methods, the CNN-MLP and CNN-LSTM methods were also used. Unlike conventional end-to-end methods, the glottal flow signals were used to replace raw speech signals as the input for CNN-MLP and CNN-LSTM-based classifier.
Further details of five baseline methods compared with our proposed methods can be seen in [17].

D. EXPERIMENTAL EVALUATION
In order to investigate the effectiveness of our proposed methods, three common evaluation criteria suggested in [42] are used: classification accuracy, sensitivity, and specificity. Accuracy is computed as: where TP and TN are the true pathological voice and true healthy voice where the fully trained network correctly predicts the pathological voice and healthy voice classes. FP and FN are the true pathological voice and true healthy voice which are incorrectly classified. The sensitivity and specificity are calculated as follows:

IV. RESULTS AND DISCUSSIONS A. RESULTS ON RS-MSConvNet
This subsection reports the performance of RS-MSConvNet.
The following conclusions based on varying the configuration of parameters can be drawn: • Since the fixed-length segments has an effect on the performance of end-to-end network, it is important to find out suitable fixed-length segments. In this paper, different fixed-length segments of 240-ms, 250-ms, 500-ms, 1 s and 3 s were first investigated to find out the optimal fixed-length segment. Table. 6 reports the results of RS-MSconvnet model using different fixed-length segments.
It can be seen that the fixed-length segment of 500 ms provided the best performance compared with the others. The reasons were that the network could not sufficiently learn the overly short data of raw speech signals (240 or 250 ms), which might have not contained enough pathological information while a fixed-length speech segment longer than 500 ms could not be exploited due to the short durations of some of the vowels as summarized in [47]. Moreover, the fixed-length segment longer than 500 ms might have led to overly small training data making the fully trained network incompetent to detect pathological speech from healthy speech. The results indicate that the fixed-length segment of 500 ms (49 × 320 pixels) was the most suitable for the RS-MSConvNet model.
• Many trials, which is not reported in this section (data not shown), were conducted by adding/reducing the convolution layers and changing the parameters in Multi-scale convolution block and spatial-temporal feature extraction block but they did not achieve better results. Moreover, batch normalization was also applied to spatial-temporal feature extraction block and it could not improve the detection performance. This outcome means that changing Multi-scale   convolution block, spatial-temporal feature extraction block, and the parameter was unsuitable for our RS-MSConvNet method.
• Finally, since the number of FC layers has an effect on the detection performance of pathological voice, the numbers of FC layers was varied from 1 to 5 layers. Table. 3 shows the comparison among different FC layers. It was found that the detection performances decreased while using more than two layers. This is because one FC layer is suitable for the classification based on two classes and the limited training data suggested in [43] and [44]. This suggested that using one FC layer as classification was suitable for our RS-ConvNet model.
To visualize discriminating information using the scaled feature representation for pathological voice detection, a healthy voice and a pathological voice signal were chosen to be fed into a fully trained RS-ConvNet model. The output representation derived from second convolution layer to fourth layers were then compared to show discriminating feature information of healthy and pathological voices. In this paper, the representation images are displayed using the matplotlib function based on bilinear interpolation method [45]. Here, since the fixed-length segment of 500 ms which provided the best results mentioned above was used as the input for RS-MSConvNet, the output sizes of second layer, third layer, fourth layer are 10×80 pixels, 5×40 pixels, and 2×20 pixels, respectively. Fig. 4 shows a comparison of feature representation between healthy voice and a pathological voice signals with a similar amplitude signature. We can observe from Fig. 4 that the convolution layers provided different representation between healthy voice and pathological voice. This indicated that the proposed Multi-scale scale convolution block could give discriminative features for the pathological voice detection.
Next, to observe discriminating ability using the spatialtemporal features derived from the trained RS-MSConvNet model for detecting pathological voices, the t-distributed stochastic neighbor embedding (t-SNE) [46] which is a commonly used method for dimensionality reduction was exploited to consider the distributions between healthy voice and pathological voice categories. Here, 200 pathological and 200 healthy voice samples were selected to show the distributions of the two classes based on the t-SNE analysis. Fig. 5 shows visual distribution of the spatial-temporal feature derived from the trained RS-MSConvNet model. As seen in Fig. 5 (a), the data distributions of the different classes using raw speech signals without any feature extraction were significantly overlapped. This caused difficulty in distinguishing the different voices. By comparing Fig. 5 (a) with (b), it can be seen that the data distribution of the proposed spatial-temporal feature performed better than using raw speech signals without any processing method because it provided clear contours and small inter-class distances. This suggested that the spatial-temporal feature based on the RS-MSConvNet could be useful for pathological voice detection.

B. RESULTS ON RS-MSConvNet-SVM
This subsection presents the results of RS-MSConvNet-SVM. Because the γ value directly affects the SVM-based detection performance, it is important to find out the optimal γ . Here, the γ values varied from 0.1 to 30 by the step size of 0.1, with the optimal γ at 0.1 which provided the highest accuracy by investigating the validation set on the hybrid model trained by training set. Therefore, the fully trained RS-MSConvNet model using the optimal γ at 0.1 was used to evaluate the testing set because the decision ranked by the distance between the RS-MSConvNet based-automatically extracted features and the hyperplane of the trained SVM model achieved the highest number of correct predictions. Fig. 6 shows the result of the RS-MSConvNet-SVM model compared with the results of the RS-MSConvNet model.
As seen in Fig.6, improved accuracy performance was obtained using the hybrid model. The accuracy was improved from the RS-MSConvNet with 86.46 % to the RS-MSConvNet-SVM with 87.61 %. This can be attributed to the decision ranked by the distance between the RS-MSConvNet based-automatically extracted features and the hyperplane of the trained SVM model performing higher specificity compared with the RS-MSConvNet classifier, which led to directly improving the detection accuracy. This result indicated that the RS-MSConvNet-SVM seems useful for detecting pathological voice from healthy voice.

C. COMPARISON WITH BASELINE SYSTEMS
In this subsection, the performances of our proposed methods are compared to those of some known systems. As mentioned in the introduction section, some systems may not be discussed due to the experiments being based on speaker-dependent approach and different database from our experiments. Here, the results based only on the TORGO database for a speaker-independent approach, which is the same condition as our experiments, were compared. Table. 7 shows the results of some known systems compared to our proposed methods.    As seen in Table. 7, the results obtained with the RS-MSConvNet and RS-MSConvNe models outperformed all known systems in terms of accuracy and specificity performances. For the specificity result, it was observed the end to end system approach (CNN-MLP) using glottal flow information performed better than the proposed methods because the glottal flow signal gave better discriminative information than raw speech signal as summarized in [47] making the detection of pathological voice more specific. However, accuracy and specificity of the CNN-MLP using glottal flow information was worse than those of the proposed systems. This indicated that the proposed methods could give more reliable classification performance without requiring expert knowledge in pre-processing for computing alternative efficient signal to replace raw speech signal.

V. CONCLUSION
In this paper, we proposed a new RS-MSConvNet architecture for pathological voice detection. The main contribution of the proposed RS-MSConvNet method is to use Multi-scale convolution neural network, followed by spatialtemporal feature, and FC layer as classification. In addition, we proposed a hybrid model by integrating RS-MSConvNet as trainable feature presentation and support vector machine (SVM) as a classifier and referred to it as RS-MSConvNet-SVM model. The performances of our proposed models were evaluated using the TORGO database. From the experimental results, it was observed that the RS-MSConvNet gave the discriminating feature information between healthy and pathological voice via the t-SNE method and provided an accuracy of 86.46 %, which outperformed other baseline systems. In addition, improved accuracy performance was obtained using RS-MSConvNet-SVM model.The accuracy was improved from the RS-MSConvNet with 86.46 % to the RS-MSConvNet with 87.61 %. The results indicated that our proposed RS-MSConvNet and RS-MSConvNet-SVM approaches could be useful for pathological voice detection.
In the future, the effectiveness of using the attention mechanism will be explored to further improve our proposed RS-MSConvNet and RS-MSConvNet-SVM approaches. We will also use Glottal flow signal [48] to replace raw speech signal as the input of our proposed methods.