A Spectrogram-Based Deep Feature Assisted Computer-Aided Diagnostic System for Parkinson’s Disease

Parkinson’s disease is a neural degenerative disease. It slowly progresses from mild to severe stage, resulting in the degeneration of dopamine cells of neurons. Due to the deficiency of dopamine cells in the brain, it leads to a motor (tremor, slowness, impaired posture) and non-motor (speech, olfactory) defects in the body. Early detection of Parkinson’s disease is a difficult chore as the symptoms of disease appear overtime. However, different diagnostic systems have contributed towards disease detection by considering gait, tremor and speech characteristics. Recent work has shown that speech impairments can be considered as a possible predictor for Parkinson’s disease classification and remains an open research area. The speech signals show major differences and variations for Parkinson patients as compared to normal human beings. Therefore, variation in speech should be modeled using acoustic features to identify these variations. In this research, we propose three methods- the first method employs a transfer learning-based approach using spectrograms of speech recordings, the second method evaluates deep features extracted from speech spectrograms using machine learning classifiers and the third method evaluates simple acoustic feature of recordings using machine learning classifiers. The proposed frameworks are evaluated on a Spanish dataset pc-Gita. The results show that the second framework shows promising results with deep features. The highest 99.7% accuracy on vowel $\backslash \text{o}\backslash $ and read text is observed using a multilayer perceptron. Whereas 99.1% accuracy observed on vowel $\backslash \text{i}\backslash $ deep features using random forest. The deep feature-based method performs better as compared to simple acoustic features and transfer learning approaches. The proposed methodology outperforms the existing techniques on the pc-Gita dataset for Parkinson’s disease detection.

associated with movements and are more perceptible as compared to non-motor symptoms [3]. In motor symptoms, the patient suffers from slowness of movement referred to as bradykinesia, rigidity, postural instability and tremor. Non-motor symptoms are evident at a particular interval; include sleep disorder, speech, and swallowing problem and olfactory disorder (loss of sense of smell) [1], [3]. The effect of Parkinson's disease on speech is characterized as phonation, articulation, and prosody. Phonation is the use of vocal folds for speech and articulation refers to the use of special tissues in speech production. While prosody is related to amplitude, loudness and pitch to produce sound. Most of the work in Parkinson disease detection considered phonation that includes the pronunciation of vowels \a\, \e\, \i\, \o\, \u\ [4].
Speech signals are usually considered as one of the main methods to diagnose Parkinson's disease. In [5] authors performed experimentation on speech signal recordings of three languages identifying speech pronunciation problems in it. It is observed that speech pronunciation like vowels, sentences, words are affected by this disease. Therefore, speech is considered as a major predictor of PD disease. In [6] authors accounted that articulation, intelligibility, prosody features of speech signal shows promising results in the detection of PD. In [7] the author presented that somehow the age factor also contributes to disease. They assessed that the speech recordings of young speakers show significant defects in speech pronunciation tasks. The researchers also observed the monitoring of skype calls using normal sentences shows significant errors in the pronunciation of PD patients [8]. Traditionally, acoustic features are considered in most of the recent works along with SVM for PD detection. Most of the recent work has performed disease detection using gait [9], handwriting [10], [11] and speech datasets. Furthermore, the literature study shows Gaussian based model, several machine learning techniques, convolution neural networks [12] are contributing to PD diagnosis. Parkinson's disease detection using most suitable speech impairment features is imperative and still is an open research area.
This research work contemplates speech recordings using spectrograms and acoustic features. In our work, all recordings are transformed into short-time Fourier transform (spectrograms) that are used in the transfer learning method. We proposed a simple acoustic features based method and also considered a pre-trained convolution neural network architecture [13], [14] Alexnet model for deep feature extraction and detection of PD. For fair comparison, we have also used transfer learning-based classification. To evaluate the performance of the proposed methods, Parkinson's disease speech recordings from PC-GITA [15] dataset are used. The results show that the deep features based technique produced better results. The main contributions of our research for PD detection using speech signals are the following: 1. We propose a spectrogram based approach to extract deep feature to distinguish PD patients from healthy 2. We propose an acoustic-phonetic based approach for the detection of PD disease 3. We conducted a comparison between a proposed deep feature-based approach with simple acoustic features and transfer learning-based methods The rest of the research paper is organized as follows. Section 2 explains the literature review, section 3 explain the proposed methodology, section 4 shows experimental setup and dataset details, section 5 gives detail view of results and simulations, followed by the conclusion and future work.

II. LITERATURE REVIEW
This section explains the existing techniques on Parkinson's disease detection using the Spanish speech dataset. The techniques presented by different researchers are grouped into two categories that are explained in detail in the following subsections.

A. MACHINE LEARNING BASED METHODS
In the past few years, machine learning-based disease classification is widely used in the medical field and has acquired remarkable significance [16], [17]. A Gaussian based density was considered by Moro-Velazquez et al. [18] from four to five different PD corpora. They exploited phonetic text-dependent utterances that require vocal tract features of speech signals. They assessed words, sentences, monologues and vowels in three corpora with more male patients (Czech dataset). They showed better results as compared to other (Spanish) datasets and reported 81% accuracy. Rueda et al. [6] used a wrapper feature selection method for vowel \a\ and words \pa-ta-ka\ (Articulation, phonation, Diadochokinetic features) from pc-Gita recordings and achieved 70% accuracy. Pérez-Toro et al. [19] considered classical features like term frequency and bag of words of monologues only. They used pc-Gita recordings and reported that language pronunciation contains enough information for PD classification and achieved 72% accuracy results.
Karan et al. [20] assessed inherent and decomposition-based features from vowel \a\, \o\ only from two datasets pc-Gita and Saarbrucken dataset. They have reported 96% accuracy on the Spanish dataset using random forest and support vector machine. Kacha et al. [21] showed that use of PCA for articulation features from sentences of Spanish pc-Gita speech spectrograms (STFT) is efficient for the detection of Parkinson's disease. Parra-Gallego et al. [4] considered articulation and intelligibility features from the words like \pa-ta-ka\ from Spanish pc-Gita recordings. The authors reported 88% accuracy in their work when classifiers were trained using intelligibility and articulation-based features.
Vasquez-Correa et al. [22] proposed a novel approach by considering the on-off state of vocal folds (i.e. on when the candidate starts speaking and off when they stop speaking). It includes words, vowels, monologues, and sentences from pc-Gita recordings. They reported 94.9% accuracy in speech signals for PD classification. Garcia et al. [23] VOLUME 8,2020 assessed the appropriateness of i-vectors to classify Parkinson's disease using articulation, prosody, phonation features for words, vowels and 10 sentences from pc-Gita dataset. They computed cosine difference that confers for disease detection contributing 78% accuracy using articulation features. Klumpp et al. [24] developed a model that considered the voice signal during phone calls and also considered syllable \pa-ta-ka\. They evaluated the severity and onset of PD disease in their work. Arias-Vergara et al. [7] assessed the aging factor in their research work. They considered articulation, prosody features, and age factors from pc-Gita Spanish speech vowels.
The authors also accounted for gender factors and reported that the age factor plays a very important role in classifying Parkinson's disease. Their research evaluated that young speaker signals are more contributing towards PD classification process than the old age speaker signals. They modeled binary and multi-class support vector machines and compared their results with neural networks and reported 95% accuracy. Moro-Velázquez et al. [25] presented a new approach for classifying a speech signal into Parkinson's patient or healthy patient. They proposed a phonological feature-based method in which the speech signal of words, monologues and read text from the Spanish pc-Gita dataset. The authors reported that this approach is quite useful in the assessment of Parkinson's disease patients in clinics. Moro-Velázquez et al. [25] used traditional machine learning methods considering articulation and phonological features for Parkinson's disease detection. They assessed kinetic features for /pa-ta-ka/, two read sentences and a sustained vowel /a/ from Spanish pc-Gita dataset from Parkinson disease patient (PDP) speech signals. They performed classification tasks using Gaussian mixture modeling and i-vectors and reported 87% accuracy. Orozco-Arroyave et al. [26] proposed an open-source software for Parkinson's disease. The authors have used phonation, articulation, prosody, and intelligibility dimensions of speech signal from pc-Gita dataset for vowels using conventional machine learning to identify Parkinson. They designed a system that can be easily adopted by clinicians to assess different voice diseases. Arias-Vergara et al. [8] proposed a model for assessment of Parkinson's disease using individual speaker speech signal analysis. They assessed phonation, articulation, and prosody to model recordings of spontaneous speech and a read text from the Spanish pc-Gita dataset from different channels (mobile phone calls, online calls like skype). In this work, authors observed that skype speech signals were effective in distant observation of Parkinson's disease patients. They performed evaluation using Gaussian mixture modeling and i-vectors and obtained a 0.77% correlation. Vásquez-Correa et al. [27] presented an improved version of m-FDA for PD detection. They considered phonation, articulation, prosody, and intelligibility features from Spanish vowel /a/, sentences and words. Orhan et al. [28] considered vowel recordings of freely available datasets [29]. They incorporated statistical pooling for increasing features and used ReliefF for selecting the best features and achieved 91% accuracy using SVM. El Maachi et al. [30] evaluated the gait physionet dataset for PD diagnosis. The author assessed Parkinson disease from gait using a deep 1-D neural network. They achieved 98.7% accuracy using deep neural networks and 85.3% accuracy in finding the severity of PD. Turker and Dogan [31] proposed an octopus based multiple pooling method (comprising of eight poling method) for feature extraction. They evaluated the vowels dataset and achieved 99.2% accuracy using SVM for PD and gender classification. Diogo et al. [32] presented an approach for early diagnosis of PD using three distinct database consisting of vowel pronunciation in different language (Portuguese, uci data etc). They evaluated acoustic and phonetic characteristics of speech. Their work achieved 99.94% of highest accuracy using random forest.

B. CONVOLUTION NEURAL NETWORK-BASED METHODS
Recent studies show that neural networks immensely contribute to speech classification [13], [33]. Trinh and Darragh [12] proposed a convolution neural network-based approach for two PD datasets-Saarbrucken voice databases and pc-Gita dataset and achieved 96.7% accuracy for a pc-Gita dataset. Naranjo et al. [29] proposed a convolution neural network model for feature extraction and classification process. They considered articulation from /pa-ta-ka/ words, sentences and read text from pc-Gita and observed start, stop signal of speech. They extracted features from spectrograms and achieved 89% accuracy using Gaussian mixture modeling and i vector. Teixeira et al. [34]. concentrated on discretizing a neural network and its information sources by utilizing quantization and weight scaling. They applied linear homomorphic encryption batching technique on the Spanish pc-Gita dataset. They observed the time of 1.4ms rather than the original approach where prediction took 4.5s to compute results. Arias-Vergara et al. [35] considered phonation, articulation, and prosody data from monologue recordings of extended version of the Spanish language dataset. They used support vector machine and convolution neural networkbased model to extract the most suitable features and support vector machine for classification. The achieved 84% accuracy and showed prosody features are effective than the others.

III. PROPOSED METHODOLOGY
In this work, we proposed spectrogram and acoustic featurebased frameworks for Parkinson's disease classification. The first framework employs a transfer learning approach for speech spectrograms. In our second proposed method, we evaluated deep learning-based feature extraction from speech spectrograms while the third method considers the simple acoustic feature method for Parkinson disease detection. Deep learning has seen tremendous results in many fields such as computer vision, image processing and speech signal recognition [14]. In our work, we used pre-trained convolution neural network architecture Alexnet for feature extraction from speech signal and speech spectrograms.   method. In both models, the first step is signal preprocessing followed by deep feature extraction using Alexnet and handcrafted feature extraction. We have performed classification using transfer learning and machine learning classifiers. All steps of the methodology section are explained in detail in following subsections.

A. SIGNAL PREPROCESSING
To input data into a classifier, the speech signals are first converted to spectrograms. A spectrogram is a visual representation of the signal spectrum that changes over time [36]. In our work, we converted Parkinson disease speech signal into a spectrogram. In the time domain, digitally sampled data is divided into segments that overlap and form Fourier transform that calculate the spectral amplitude of each segment. Each segment corresponds to vertical line in image. Table 1. Shows signal processing parameters, Fig 2. shows waveform of monologue pronounced by healthy candidate and Fig 3. shows waveform of Parkinson disease.

B. FEATURE EXTRACTION: DEEP FEATURES USING ALEXNET MODEL
In our work, we used deep learning and machine learning methods for the classification process. In order to input data into our classifiers, deep features were extracted [37], [38] from our speech signal dataset. We used deep learning convolution model Alexnet for extracting deep features from  the pc-Gita dataset. Our dataset consists of spectrograms of vowels \a\, \e\, \i\, \o\, \u\, monologues and read text. Alexnet is a deep neural network-based model that consists of 8 layers. The first five layers form the convolution layer whereas the last three layers combine to form fully connected layers. Fig 4. shows feature extraction process using Alexnet. To input data into Alexnet model, we scaled spectrograms to fit into the model. Spectrograms obtained from signal preprocessing are of size 224 × 224 whereas Alexnet accepts input images of size 227 × 227.
Thus, all images are scaled accordingly. Alexnet model has a convolution layer, hidden layers, and classification layer at the end. First five convolution layers of the network is trained on ImageNet data and last three fully connected layer are replaced with target Parkinson disease data. In this work, VOLUME 8, 2020  we extracted features from the first five convolution layers thus fully connected classification layers are not used in this architecture. The model extracts deep shallow features from individual data vowels, monologues and reads text. Table 2. shows a total number of features extracted from each set of Parkinson's Spanish speech data. The generic transfer learning architecture is presented here:

a: INPUT LAYER
In the transfer learning model, the first layer accepts input that is basically images of size 227 × 227. We input RGB spectrograms (vowels, monologues, read the text) and each of them is given as a separate input.
In eq.1, N i represents a number of images, W i is with for input image i, H i is height D is the depth. This layer performs all computation and transfer it to the next fully connected layers. This layer generates convolution feature map c1, c2, c3, c4, c5 of input images and refers each feature map of the previous layer to the next layer.

d: POOLING LAYER
This layer reduces all computations and parameters to reduce network complexity. It reduces the dimensionality of input data by mixing the output of the previous layer with the input of the next layer.

e: FULLY CONNECTED LAYER
A fully connected layer connects neurons in a layer to each neuron in another. In principle, it is identical to the traditional multilayer perceptron. The dashed matrix passes through fully connected layers to classify data.

f: REPLACEMENT OF LAST LAYERS
To perform classification, we initially trained our Alexnet model on ImageNet data images embedded in it. To test our model's accuracy we replaced layers with our target data pc-Gita Spanish speech recordings' spectrograms.

g: NETWORK TRAINING
We trained our network for each set of dataset i-e vowels, the weighted learn rate is also varied at different points from 30 to 70 to assess accuracy at different learn rates. The bias factor is varied from 40 to 70 and batch sizes 5 to 10. The initial learn rate is set to le-4. Table 4 show complete details of the parameters.

C. FEATURE EXTRACTION: HANDCRAFTED FEATURE-BASED MODEL
In this feature extraction model, we extracted simple acoustic features from Spanish speech recordings. Separate sets of features were extracted for both Parkinson's disease patient and healthy patient for each set of data monologues, vowels, read text and words. We extract simple acoustic features [39] of each recording. These features include spectral features and statistical features. Table 3 shows the details of features extracted for each recording and derivative of each feature.

D. PARKINSON DISEASE CLASSIFICATION
After signal preprocessing and feature extraction, we performed classification. To classify PD patients, we utilized deep learning convolution neural network architecture and machine learning models; support vector machine, random forest, and multilayer perceptron. The following subsection explains methods for classification in detail.

1) TRANSFER LEARNING APPROACH.
The transfer learning method is the most widely used technique in the deep learning model. We trained model on base data and utilized it to learn features and transfer it to target data [40], [41] using Alexnet model [42], [43].

2) MACHINE LEARNING BASED APPROACH
In this work, we utilized machine learning classifiers for Parkinson disease detection. We used simple and deep features that were extracted using feature extraction shown in Fig 5 using Alexnet. We performed 5 cross-validation for each set of our data. The following sections explain classifiers in detail with varying parameters used to train and test data.

a: SUPPORT VECTOR MACHINE
The support vector machine has been widely used in different fields like computer-aided diagnostic system, recognition and vision system [37], [38]. It is the most popular used machine learning model for binary classification due VOLUME 8, 2020 to its generalizability. It generates different support vectors of the given input. It identifies the linear and non-linear surfaces in the input support vectors by constructing hyperplane, which later classifies the data. The complexity parameters in this model build hyperplane from the class label. The hyperplane that computes the largest distance from the training data has the highest classification result. The parameters like gamma rate are set to 0.01, complexity value is 1.0, the degree value is 3 and the coefficient coef0 is 1. We have used the linear in our research. In the linear kernel, the projection for input task is considered by dot product for the input value y and the support vector y i is calculated.
f (y) = B(0) + sum (z i (y, y i )) (eq.2) In eq.2, coefficient B (0) and z i is estimated for each input value y and evaluated by learning a data learning algorithm.

b: RANDOM FOREST
An ensemble method widely used in the classification processes that make use of different decision trees for classifying data [44]. It builds the bootstrap templates from the random forest original data and grows a raw classification or regression tree for each bootstrap template. It considers each node instead of choosing the only best disclosure from all predictors. It performs a random selection of predictors and chooses the best split between them as shown in Fig 6. In our research, we utilized default parameters for random forest.

c: MULTILAYER PERCEPTRON
A multilayer perceptron is an artificial neural network that are broadly utilized in speech, image, and vision recognition system. It has been observed from recent research that a multilayer perceptron is extensively used in the medical diagnostic field [43], [44]. It is a feedforward neural network that is made up of an input layer, an output layer and in between, them is a hidden layer. The input layer accepts the input value whereas the hidden layer sends information from the input layer to the output layer as shown in Fig 6. A hidden layer consists of a number of neurons, each of the hidden layer neuron has information for its input influencing it by growing them by their connection weights. The output of each neuron is defined as In eq.3, f is defined as an activation function, which is proportional to input weights. It is mostly some threshold value a simple sigmoid or a hyperbolic tangent function. The learning rate for the multilayer perceptron ranges from 0 to 1 where 0.3 is set as a default value.

IV. EXPERIMENTAL SETUP A. DATASET
We utilized PC-GITA [15] Spanish language dataset. The dataset consists of a Spanish speech signal recording of 50 people that are PD patients and 50 HC people as shown in Table 5. The dataset includes recordings of 25 male and 25 female persons. The dataset belongs to Spanish language Table 5 depicts dataset details. The dataset consists of the recording of vowels, monologues and read the text. Each recording consists of different voice features that are discussed below. 35488 VOLUME 8, 2020

1) PHONATION
The phonation analysis in continuous speech is performed by extracting voiced segments from the utterance. The feature set includes seven descriptors such as jitter and shimmer. The first and second derivatives of F0, long term perturbation features such as the amplitude perturbation quotient, the pitch perturbation quotient, and the energy.

2) PROSODY
The prosody features are based on duration, the F0 contour, and the energy contour. We computed 13 features per utterance including the average, standard deviation, and maximum value of F0.

3) ARTICULATION
The articulatory capability of the patients is evaluated with information from the onset/offset transitions to model the difficulties of patients to start/stop the movement of the vocal folds. The set of features extracted from the onset and offset includes 12 Mel-Frequency Cepstral Coefficients (MFCCs) with their first and second derivatives.

B. EVALUATION METRICS
The results obtained after classification from deep and machine learning models are evaluated using the following evaluation metrics explained in the below subsection.

1) ACCURACY
It is defined as the total number of samples that are truly classified and the total number of negatively classified results.
In this TP is true positive and TN is true negative, and total represents total number of class predictions.

2) SENSITIVITY
Sensitivity is the ability of a test to correctly identify those with the disease (true positive rate) is the ability of the test to correctly identify those without the disease (true negative rate).

4) F1 SCORE
It is defined as a ratio or numerical average of precision and recall values from the classification result.

V. RESULTS
This section explains in detail the results obtained transfer learning, deep feature-based, and machine learning approach.

A. TRANSFER LEARNING BASED APPROACH RESULTS
In our first framework, the initial step comprises the conversion of Spanish speech recordings into spectrograms. All speech recordings are transformed into their relative spectrograms by using parameters described in signal preprocessing. In transfer, the learning approach model is trained using a source dataset, which is replaced by our target dataset [6]. Once our model is trained on the source dataset, we replaced it with our target dataset speech recordings spectrograms. The spectrograms of each set of data i.e. vowels, monologues, read text and words are individually considered by varying different parameters. The model is trained by varying parameters discussed in the network training section discussed earlier. The corresponding results obtained by varying parameters are shown in Table 6. The results depicted in VOLUME 8, 2020 For words, /apto/ we achieved 77.2 % accuracy on epoch size 7 and weighted learn rate 60, whereas for \atelta\ we achieved 73.7% accuracy on epoch size 9 and weighted learn rate 40. It is observed from the above results that the highest accuracy is observed in reading text 91% that is a major contribution using transfer learning in this research.

1) DEEP FEATURES BASED RESULTS
In our second framework, we evaluated our dataset Spanish speech signal dataset using a machine learning model. The initial step comprises of deep feature extraction. The Spanish speech recordings spectrograms obtained after signal preprocessing steps are used to extract the speech features. In this approach, spectrograms are used as input into the feature extraction Alexnet model. The model extracts deep features from spectrograms as shown in the transfer learning architecture model in the above sections. The total number of features extracted for the vowel recordings dataset is 150. The 50 features are extracted for monologues, read text and word dataset. The extracted features for each of the data  separately input into different machine learning classifiers. The support vector machine, random forest and multilayer perceptron are separately validated on our Spanish speech recordings dataset. All of these classifiers used five-fold cross-validation. The support vector machine results show TABLE 6. Accuracy obtained using transfer learning model by varying parameters, bias learn rate 50, initial learn rate is le-4, epoch size vary from 6-10. that the highest accuracy obtained for vowel \o\ is 93% whereas the least accuracy result is 83% on the monologue dataset. The highest accuracy in this model obtained is on vowel \ e\ 99.4% whereas other vowel results also show an average accuracy of 99.1%.
The multilayer perceptron showed the highest accuracy of 99.7% on vowel \e\, \i\, \o\ whereas on the other data this model outperforms the other classifiers results. The results of the machine learning model clearly depict that vowels are sufficient in the classification of Parkinson's disease. Fig 9. depicts the accuracy obtained for each classifier. Minimum accuracies obtained using a machine learning approach is using support vector machine 83%. In the monologues dataset, the highest 99.7% accuracy is obtained using a multilayer perceptron.

2) HANDCRAFTED FEATURE-BASED RESULTS
Different acoustic features are used for Parkinson's disease detection. In this work, we consider handcrafted acoustic features from the Spanish speech dataset. This part of our research work performs a comparison with our deep featurebased machine learning model and transfer learning classification. Fig 10. depicts the results obtained using simple acoustic features. Fig 10. depicts the results obtained for each set of data. It is observed that the highest accuracy is observed on vowel \e\ that is 84.6% using random forest, vowel \o\ showed 83.6% accuracy and vowel \a\ presents 83% accuracy while vowel \I\ show 82.8% accuracy using random forest. Least accuracy in the vowel dataset is observed using vowel \u\. In the case of the reading text dataset, the highest accuracy 73% is observed using the random forest. Monologues dataset showed a bad accuracy of 37% using a multilayer perceptron and 15% using random forest. Thus, it is observed that vowels are efficient in Parkinson disease detection when handcrafted features are utilized. However, in comparison to deep features, this accuracy is far less as the highest 99.7% accuracy is achieved. Thus, deep features based methods outperform other methods.

C. COMPARATIVE ANALYSIS
This section performs a brief comparison of recent existing techniques with the proposed technique in this research work shown in Table 7. In this research work, we evaluated deep learning and machine learning-based approaches. Each of these approaches considered a pc-Gita dataset that consists of Spanish speech recordings which include pronunciation of VOLUME 8, 2020  vowels, monologues, words and read the text. All of them are separately used as an input in both our model and their corresponding accuracies are recorded. The results clearly depict that deep feature-based method outperformed by presenting 98.3% of average accuracy for random forest and 99.3% for multilayer perceptron. In comparison to Karan et al. [20] incorporated inherent based features from two different sets of data observed 96% accuracy when classified using random forest and support vector machine. These results are better than the already published work for the same datasets which are presented in Table 7.
In comparison to our transfer learning approach Trinh and Darragh [12] used a convolution neural network in their work. Tuncer and Dogan [31] assessed the extended version of Spanish recordings corpus using a convolution neural network. Trinh and Darragh [12] used Gaussian mixture modeling and i vector in their proposed work. The comparison shows that our proposed machine learning method outperformed existing techniques, however, the transfer learning approach compared with existing techniques does not prove to be as much as accurate than the convolution neural network research that has been already presented [45], [46]. Parkinson disease shows several motors and non-motor symptoms. Gait, tremor and handwriting problem appear overtime. However, changes in the speech and handwriting start appearing early and are more evident. Changes in speech show a clear distinction between a healthy and PD patient. They suffer pause in pronouncing as well as jarring sound. Classification of PD is crucial; however, speech impairments are considered as an important biomarker for PD detection. However, it is essential to perform the speech collection of PD in a noise-free environment. Speech analysis using deep learning-based methods proved as a very good method because the variation in speaking can be modeled effectively. However, the symptoms of Parkinson can be observed in handwriting and gait. Therefore, the PD identification can be done in a more efficient way by combining all the above-mentioned biomarkers.

D. CONCLUSION
Parkinson's disease is one of the common diseases among people worldwide. Early diagnosis of the disease is an open research and many researchers have shown significant work in achieving the highest accuracy for its detection and diagnostic. In our work, we used Alexnet model for deep feature extraction and handcrafted feature extraction from the Spanish speech recordings dataset. For classification, we used transfer learning, deep feature and acoustic-phonetic featurebased methods. We proposed that deep features extracted using Alexnet are efficient to distinguish Parkinson patients from healthy patients. In our proposed models, random forest, multilayer perceptron, transfer learning achieved highest accuracy of 99%, 99.7%, 72% respectively. For vowels \o\ the same classifiers achieved 99%, 99.6, 76%. For read text, the same classifiers achieved 97.8%, 99%, 91% whereas for monologues 97%, 99.3%, 86.36% accuracy is achieved. Thus, our work shows that speech analysis using deep features are efficient in distinguishing between healthy and Parkinson patient with high accuracy. Further, we also conclude that pronunciation recordings of vowels are enough in distinguishing patients from healthy. In future, the gait, tremor and other symptoms data can be assessed together using a deep feature method to identify that to what extent this approach is suitable with other scenarios. Furthermore, the feature selection process can be used for handcrafted feature method. MAHEEN BAKHTYAR received the master's and Ph.D. degrees from the Asian Institute of Technology (AIT), Thailand, with one year of research experience from the National Institute of Informatics, Tokyo, Japan. She is currently working as an Assistant Professor with the Department of Computer Science and Information Technology, University of Balochistan, Pakistan. Her research interests mainly include information/knowledge management and retrieval, natural language processing, sentiment analysis, text processing, language understanding, question answering systems, and ontology processing.
JUNAID BABER received the M.S. and Ph.D. degrees in computer science from the Asian Institute of Technology, Thailand. He has spent one year as a Research Scientist with the National Institute of Informatics, Tokyo. He is currently working as a faculty member with the University of Balochistan, Quetta. His research interests lie in machine learning, high performance computing, and data analytics.
HABIBULLAH JAMAL received the B.Sc. degree in EE from the University of Engineering and Technology, Lahore, Pakistan, in 1974, and the M.A.Sc. and Ph.D. degrees in electrical engineering from the University of Toronto, Canada, in 1979 and 1982, respectively.
He is currently a Professor of engineering sciences with the Ghulam Ishaq Khan Institute, Topi, Pakistan. He is the author of two textbooks and 132 research articles. His research interests include (but not limited to) signal processing, the design of microelectronic circuits and development of novel computer architectures for telecommunication, and national defense and other applications.
Dr. Jamal was a recipient of prestigious national level awards 8th TERADATA National IT Excellence Awards for Excellence in IT Education, IRFAN MEHMOOD is currently a Senior Lecturer with the University of Bradford, U.K. His sustained contribution at various research and industry-collaborative projects gives him an extra edge to meet the current challenges faced in the field of multimedia analytics. Specifically, he has made significant contribution in the areas of video summarization, medical image analysis, visual surveillance, information mining, deep learning in industrial applications, and data encryption.
OH-YOUNG SONG received the B.S., M.S., and Ph.D. degrees from the School of Electrical Engineering and Computer Science, Seoul National University, South Korea, in 1998, 2000, and 2004, respectively. He was a Postdoctoral Fellow at the School of Electrical Engineering and Computer Science, Seoul National University, from 2004 to 2006. He is currently an Associate Professor with the Department of Software, Sejong University, South Korea. His research interests include computer graphics, simulation, and machine learning. Especially, he has made a contribution in the areas of physics-based animation, human motion, numerical algorithms, VR/AR, medical image analysis, and deep learning. VOLUME 8, 2020