Introduction
Parkinson’s disease (PD) is the second most common neurodegenerative disorder [1], which is expected to increase with an ageing population. There are no biomarkers to diagnose the disease, which requires the observation of the complex set of symptoms of the patients. Acoustic speech abnormalities have been reported even in early-stage PD patients and even when there is no perceptible dysarthria [2]. Several investigators have found impaired speech parameters in early-stage PD using objective acoustic measures [3], [4]. Several studies have investigated the difference between the voice of PD and healthy control (HC) using different approaches [4], [5], [6], [7], [8], [9], [10], [11], [12].
Human speech requires fine-motor control, cognitive abilities, auditory feedback, and muscle strength. Parkinsonian dysarthria can be characterized by reduced vocal tract loudness, reduced speech prosody, imprecise articulation, significantly narrower pitch range, longer pauses, vocal tremor, breathy vocal quality, harsh voice quality, and disfluency [4]. The differences in the voice parameters of sustained phonemes have been examined for detecting and monitoring PD [4], [13], [14]. A number of works have considered the signal features previously used for speech studies, such as speaker recognition [15], [16]. The investigation of sustained phoneme and text-dependent speech modalities for PD screening is reported in [13]. However, such analysis has confounding factors such as language skills, vision and hearing [17]. Tsanas et al. [18] have extended this to associate these with the motor disability score of PD patients.
The use of non-linear and hybrid features such as the fractal dimension (FD), entropy [19], deep multivariate features [20], and linear predictive models [21], [22] has been proposed. Godino-Llorente et al. [12] proposed an articulatory biomarker based on the kinetic envelope trace of voice that had an accuracy of 85%. In [6], 132 features were extracted from phonemes recorded in a sound-treated booth with a head-mounted microphone to train a support vector machine (SVM) and random forest classifier which achieved an accuracy of 97.7% and 90.2%, respectively to identify PD from HC.
Signal features have often been selected based on the understanding of the disease [23], [24]. The difference between the voice of healthy people and those with PD has been observed in their pitch frequency, jitter, shimmer, and harmonics to noise ratio [25]. The pitch frequency or the fundamental frequency of the vocal cords,
The above studies have shown that there are several signal features that show significant differences between the voice of PD and HC. However, most studies have not considered real-world conditions where there is background noise, and there are differences between recording devices and conditions [26], [27], [28]. There are only a few studies that used data recorded in real-life clinical setup [19], [29], [30]. Therefore, further work is required to validate the use of this for real-life scenarios, especially for remote monitoring of the patients and other telehealth applications.
The aim of this study was to identify the most suitable signal classification method that can differentiate between PD and HC when the recordings are made in real-world conditions. We investigated the phonatory parameters of three sustained phonemes and compared people with PD with HC. The data were recorded in a typical clinical setting to check for its real-world suitability using smartphones [31], [32]. Besides the statistical analysis, the SVM classifier was used to classify the voice in two classes: PD and HC. The proposed model provides the following advantages over the existing alternatives:
Data were recorded in a normal clinical setting and with background noise conditions.
The recordings were made using commercially available smartphone with default settings.
Only three phonemes were recorded and it was not dependent on language skills.
The performance was perfect, with 100% sensitivity and specificity, outperforming the state-of-the-art methods.
Matetrials and Methods
A. Participants
Seventy-two age-matched volunteers comprising 36 people with PD and 36 healthy age-matched participants as the HC group participated in this study. The data can be found in our previously reported work [30]. All the people with PD had been diagnosed with PD within the last ten years based on procedures complying with the Queen Square Brain Bank criteria for idiopathic PD [33]. The presence of any advanced PD clinical symptoms such as visual hallucinations, frequent falling, cognitive disability, or need for institutional care was an exclusion criterion [34]. People with PD were recruited from the movement disorder clinic at Monash medical center and Dandenong neurological clinic while the HC group participants were recruited from several retirement centers. Table 1 presents participants’ demographics, cognitive stage, and health history. The UPDRS-III scores [35] of all the participants show a clear difference between the groups, while the MoCA score confirms that both PD and HC did not have cognitive impairment.
The study protocol was approved by the ethics committee of Monash Health, Melbourne, Australia (LNR/16/MonH/319) and RMIT University Human Research Ethics Committee, Melbourne, Australia (BSEHAPP22-15KUMAR). Before the experiments, written consent was obtained from all the participants.
B. Methods
Figure 1 illustrates the block diagram of the proposed method of classifying PD from HC. As shown in Figure 1, three phonemes were recorded from PD and HC participants using a smartphone. Each phoneme was segmented before extracting features from it. Machine learning based classification was applied to identify PD from HC. The detail of each section is described below:
The block diagram of identifying PD from HC using sustained phonemes. The model is trained and tested using 72 PD and HC participants.
1) Voice Recording
Three sustained phonemes /a/, /o/, and /m/ were recorded from each participant. The phonemes were selected to examine a range of voice production models [36]. The vowel /a/, as in “car”, is an open-back or low vowel, produced while the jaw is wide open, with the tongue that is inactive and low in the mouth. In this, the vibration of the vocal folds dominates the sound of the vowel. The vowel /o/, as in “oh”, is a closed-mid-back vowel. The back of the tongue is positioned mid-high towards the palate, and the lips are rounded. The phoneme /m/ is a nasal phoneme produced by the vibration of the vocal folds with the air flowing through the nasal cavity. Although all three phonemes require control of the respiratory and laryngeal vocal fold muscles, there are considerable differences in patterns of activation of the rostral muscles of articulation (of pharynx, tongue, jaw, and lips).
The participants were asked to speak the phonemes for as long as it was comfortable, in their natural pitch and loudness. During the recording, they held the smartphone as if they were talking a phone call. The voice of 72 participants (36 PD and 36 HC) was recorded using an iOS-based smartphone (iPhone 6S plus) with its built-in microphone and default settings, while the participants were located in typical Australian clinics or office settings. The recordings were saved into a single-channel uncompressed WAV format with a sampling frequency (
2) Automated Segmentation and Feature Extraction
All computations, including pre-processing, automated segmentation, and statistical analysis, were performed using Matlab2018b (MathWorks) and Python. All the recorded phonemes were segmented using an envelope detection and thresholding approach. The signal features were computed from each segment. Recordings with the voice of the instructor were removed. In the original recordings, the signal-to-noise ratio was 16–24 dB (average 19.26 dB), similar to typical Australian clinical conditions. The first step for feature extraction was to locate the time instances (
The first set of features were six jitter parameters: jitter absolute (jitter abs), jitter relative (jitter rel), period perturbation quotient-3 (jitter ppq3), period perturbation quotient-5 (jitter ppq5), period perturbation quotient-11 (jitter ppq11), and frequency modulation (Jitter FM). Here, ppq 3, ppq5, and ppq11 are the perturbation of the difference between \begin{align*} Jitter\, \left ({abs }\right)=&\frac {1}{N-1}\sum \nolimits _{i=1}^{N-1} \left |{ T_{i+1}-T_{i} }\right | \tag{1}\\ Jitter\, \left ({rel }\right)=&\frac {\frac {1}{N-1}\sum \nolimits _{i=1}^{N-1} \left |{ T_{i+1}-T_{i} }\right |}{\frac {1}{N}\sum \nolimits _{i=1}^{N} T_{i}} \tag{2}\\ Jitter\, (ppq3)=&\frac {\frac {1}{N-2}\sum \nolimits _{i=2}^{N-1} \left |{ T_{i}-\left ({\frac {1}{3}\sum \nolimits _{n=i-1}^{i+1} T_{n} }\right) }\right | }{\frac {1}{N}\sum \nolimits _{i=1}^{N} T_{i}} \tag{3}\\ Jitter\, (ppq5)=&\frac {\frac {1}{N-4}\sum \nolimits _{i=3}^{N-2} \left |{ T_{i}-\left ({\frac {1}{5}\sum \nolimits _{n=i-2}^{i+2} T_{n} }\right) }\right | }{\frac {1}{N}\sum \nolimits _{i=1}^{N} T_{i}} \tag{4}\\ Jitter\, (ppq11)=&\frac {\frac {1}{N-10}\sum \nolimits _{i=6}^{N-2} \left |{ T_{i}-\left ({\frac {1}{11}\sum \nolimits _{n=i-5}^{i+5} T_{n} }\right) }\right |}{\frac {1}{N}\sum \nolimits _{i=1}^{N} T_{i}}\qquad \tag{5}\\ Jitter\, (FM)=&\frac {{\mathrm {max}(T_{i})}_{i=1}^{N}-{\mathrm {min}(T_{i})}_{i=1}^{N}}{{\mathrm {max}(T_{i})}_{i=1}^{N} +{\mathrm {min}(T_{i})}_{i=1}^{N}}\tag{6}\end{align*}
Six shimmer parameters that were extracted from the segments are the absolute shimmer (shimmer abs in dB), the relative shimmer (shimmer rel), amplitude perturbation quotient-3 (apq3), amplitude perturbation quotient-5 (apq5), amplitude perturbation quotient-11(apq11), and amplitude modulation (Shimmer AM). Here, apq3, apq5, and apq11 represent the perturbation of the difference between \begin{align*} Shimmer\, \left ({abs,dB }\right)=&\frac {1}{N-1}\sum \nolimits _{i=1}^{N-1} \left |{ 20\ast \mathrm {log}\left ({\frac {A_{i+1}}{A_{i}} }\right) }\right | \\{} \tag{7}\\ Shimmer\, \left ({rel }\right)=&\frac {\frac {1}{N-1}\sum \nolimits _{i=1}^{N-1} \left |{ A_{i+1}-A_{i} }\right |}{\frac {1}{N}\sum \nolimits _{i=1}^{N} A_{i}} \tag{8}\\ Shimmer\, (apq3)=&\frac {\frac {1}{N-2}\sum \nolimits _{i=2}^{N-1} \left |{ A_{i}-\left ({\frac {1}{3}\sum \nolimits _{n=i-1}^{i+1} A_{n} }\right) }\right | }{\frac {1}{N}\sum \nolimits _{i=1}^{N} A_{i}} \\{} \tag{9}\\ Shimmer\, (apq5)=&\frac {\frac {1}{N-4}\sum \nolimits _{i=3}^{N-2} \left |{ A_{i}-\left ({\frac {1}{5}\sum \nolimits _{n=i-2}^{i+2} A_{n} }\right) }\right | }{\frac {1}{N}\sum \nolimits _{i=1}^{N} A_{i}} \\{} \tag{10}\\ Shimmer\, (apq11)=&\frac {\frac {1}{N-10}\sum \nolimits _{i=6}^{N-5} \left |{ A_{i}-\left ({\frac {1}{11}\sum \nolimits _{n=i-5}^{i+5} A_{n} }\right) }\right |}{\frac {1}{N}\sum \nolimits _{i=1}^{N} A_{i}} \\{} \tag{11}\\ Shimmer\, (AM)=&\frac {{\mathrm {max}(A_{i})}_{i=1}^{N}-{\mathrm {min}(A_{i})}_{i=1}^{N}}{{\mathrm {max}(A_{i})}_{i=1}^{N}+{\mathrm {min}(A_{i})}_{i=1}^{N}}\tag{12}\end{align*}
Teager-Kaiser energy operator (TKEO) measures the energy of a time varying signal. It detects the amplitude and frequency modulation of a signal by estimating the product of time varying amplitude and frequency. Mean, standard deviation, and percentile values of TKEO for the contour
HNR and NHR quantifies noise in the speech signal, which is due to the incomplete closure of vocal folds. The standard deviation of pitch was computed from the instantaneous pitch frequency \begin{align*} HNR=&10\ast log\frac {R_{xx}[T_{0}]}{1-R_{xx}[T_{0}]} \tag{13}\\ NHR=&1-R_{xx}[T_{0}]\tag{14}\end{align*}
Glottal Quotient (GQ) measures the required time to open or close the glottis. The mean and standard deviation of the time when vocal folds were apart (glottis is open) or in collision (glottis is closed) were also computed. The voice analysis toolbox [7], [8], [39], which used DYPSA [40] algorithm, was used to compute GQ.
Glottal to Noise Excitation ratio (GNE) measures the noise in the signal and the turbulent noise created due to incomplete closure of vocal fold could be captured by GNE features [41]. GNE was computed using the following steps proposed by Michaelis et al. [42].
Down sampling the phonemes recordings to 10 kHz and inverse filtering to detect each glottal cycle.
Computing the Hilbert envelopes to each glottal cycle with a different frequency.
Obtaining the maximum value among the cross-correlation of pair-wise envelopes where the central frequencies of the bands are greater than half the bandwidth.
Vocal fold excitation ratio (VFER) is a measure to detect dysphonia. Proper glottis cycle results in synchronous excitation on different frequency bands; however when this is impaired, there is turbulence and there is asynchronous and uncorrelated excitation on a different frequency and thus reduced VFER.
The above-mentioned features are mainly targeted for characterizing vocal fold dynamics as it is affected in PD patients. Since the coordinators of articulator of vocal tract such as tongue, jaw, lips are also affected by PD [43], we incorporated those features that best characterize the vocal tract coordinators such as mel-frequency cepstral coefficients (MFCCs).
MFCCs measures the energy of speech signal in each frequency band (equation 15). Since the coordinators of articulators of the vocal tract such as the tongue, jaw, lips are also affected by PD [43], it is hypothesized that MFCC will be different for PD and HC. \begin{equation*} {MFCC}_{n}=\sum \nolimits _{k=1}^{K} {E_{k}\mathrm {cos}\left[{n(k-0.5)\frac {\pi }{K}}\right]}\tag{15}\end{equation*}
Spectral analysis is used to understand the oscillatory trend of the signal but does not carry the temporal information. Wavelet transform (WT) is a technique that is based on the use of time limited waves, referred to as wavelets, and performs multi-resolution, time-frequency analysis. In this context, it converts the single dimension time domain signal to two-dimensional time-frequency domain without losing the temporal information. The discrete WT (DWT) decomposes the signal into different frequency bands into approximation and detail coefficients, with each scale corresponding to scaling of the frequency by half. In this study, the recordings were decomposed at level 10 which covers the entire audible range of the recordings. Daubechies 10 (Db10) mother wavelet was chosen as the vanishing moment. Energy, entropy, and TEKO features were computed from each DWT decomposed approximation and detail coefficients.
C. Feature Selection
A large number of features increases the risks of overfitting, can lead to higher error, and increases the computational complexity [45], [46]. That is why the exclusion of redundant features is necessary [46]. During feature selection, the first step was to identify those features that were tested to be statistically different (p < 0.0001) for the two groups using the Mann-Whitney U test. Next, feature selection algorithms were applied to identify the best features. For the removal of algorithm bias, four different feature selection algorithms were compared: i) infinite latent feature selection (ILFS), ii) least absolute shrinkage and selection operator (LASSO), iii) Relief-F, and iv) unsupervised discriminative feature selection (UDFS).
D. Model Training and Classification
Support vector machine (SVM)-based machine learning classifier was deployed to label the selected features into two classes: PD and HC. The details of the SVM classifier and cross-validation are described below.
1) Support Vector Machine
Support vector machine (SVM) is a widely used supervised machine learning technique for classification. The decision boundaries or hyperplanes are developed based on the support vectors during training.
Let, vector \begin{equation*} J\left ({w,\beta }\right)=\frac {1}{2}w^{T}w+C\sum \beta _{i}\end{equation*}
\begin{equation*} y_{i}(w^{T}x+b)\gg 1-\beta _{i}~where~\beta _{i}\ge 0\end{equation*}
\begin{align*} w=&\sum {\alpha _{i}y_{i}x_{i}} \\ f(x)=&\sum {\alpha _{i}y_{i}x_{i}^{T}x}+b\end{align*}
\begin{equation*} f(x)=\sum \alpha _{j}y_{j}K\left ({x_{j},x }\right)+ b\end{equation*}
2) Cross Validation
We evaluated the model performance using leave one out cross validation (LOOCV) techniques [48]. The LOOCV method uses N-1 subjects for model training, 1 for testing, and is repeated N times, so that each subject gets a chance to be tested. The final result is the mean of the individual evaluations. The detail of the model training and testing using LOOCV is illustrated in Figure 2. Accuracy, sensitivity, specificity, and F1-score were computed as performance metrics.
Performance evaluation of the proposed model. The dataset consists of 72 PD and HC participants. The model performance is evaluated using the leave one out cross validation.
Results
A. Statistical Analysis and PD Classification
Anderson-Darling test confirmed that the voice parameters of three different sustained phonemes for two groups were not normally distributed and thus unsuitable for the parametric test. So, the group differences and significance of each feature for PD vs. HC were computed using Mann-Whitney U test [49]. Features having p-value
The classification accuracy using Relief-F based feature selection techniques for phoneme /a/, /m/, /o/, and /a/+/m/+/o/ respectively. These results were computed using leave one subject out cross validation techniques.
The accuracy of the proposed model with top 15 sorted features extracted from individual phoneme /a/, /m/, and o/ using SVM with RBF kernel is 97.22%, 95.83%, and 98.66% respectively. Based on the combined features extracted from two phonemes, PD classification was 97.22%, 98.66%, and 100% for /a/+/m/, /m/+/o/, and /a/+/o/, respectively. The proposed model accuracy became 100% when the features obtained from the three phonemes /a/+/m/+/o/ were combined. The detailed performance of the proposed model using the different combinations of phonemes is shown in Table 3. It is found that features extracted from phoneme /o/ identified PD from HC with higher accuracy compared to other phonemes and the inclusion of features from phonemes /a/ and /m/ improved the performance. It showed the highest performance when features from all three phonemes were combined to train the model. The confusion matrix is shown in Fig. 4. The confusion matrix summarises the predicted and actual classes, providing an accurate assessment of the performance by providing true positives, true negatives, false positives, and false negatives.
Confusion matrix for PD vs. HC classification. The confusion matrix for individual and combination of phonemes are shown in the top and bottom of the figure respectively.
B. Computing the Effect Size and Spearman Correlation of Each Significant Feature
The statistically significant features of each phoneme were sorted and ranked by the ReliefF-based feature selection technique. The effect size computed by Cohen’s d and the Spearman correlation coefficient of each selected phonemes are shown in Table 4. Based on the Mann-Whitney U test, each feature was assessed for statistical significance, and the corresponding p-value is listed in Table 4. The two-dimensional representation of the top two features of each phoneme is demonstrated in Fig. 5.
Selected pair of smartphone-recorded phonemes features plotted in two-dimensional space with optimal decision boundary (black line) between PD and HC for phoneme /a/ (left), /m/ (middle), and /o/ (right).
C. Robustness of the Model
A larger sample size is necessary for the training to represent modelled phenomena. However, with limited labelled data samples, which is often the case with medical data, the resultant model needs to be tested for robustness. Hence, the system performance as a function of the minimum number of data points (participants) was conducted and is presented in Fig. 6. The performance was obtained by increasing the number of participants from 8 to 50 at an increment of 6. For this purpose, the complete dataset was subdivided into two groups to construct the training set and they were randomly subdivided to get the training set by stratified random sampling. This ensured that class balance was maintained for the training set. Each step was iterated ten times and the results were averaged. The average system performance as a function of the minimum number of data points (participants) is shown in Fig. 6. The figure shows that accuracy improved with the increasing number of training subjects and plateaued with 14 subjects with classification accuracy reaching above 95.00%.
Evaluation of model performance with different number of training subjects. The boxplot represents the distribution of accuracy of the model for a different number of training subjects varies from 8 to 50. The box represents the 1st, median, and 3rd quartile of the accuracy using a varying number of subjects from the training pool randomly for ten iterations. The average accuracy of ten iterations is shown as a circle in each box.
Discussion
People with PD often have dysarthria or speech impairment which may appear in phonatory, articulatory, prosodic, and linguistic aspects. The change is complex and characterized by reduced loudness, reduced speech prosody, imprecise articulation, significantly narrower pitch range, longer pauses, vocal tremors, breathy vocal quality, harsh voice quality, and dysfluency [4]. Speech disorders are related to several factors such as inability to perform habitual tasks, loss of fine control, weakness, tremor, and rigidity of the speech production muscles.
This study has investigated the use of the utterance of phonemes /a/, /o/, and /m/ for differentiating the voice of people with PD from HC. The classification results confirm that identifying the voice of HC from PD improves when the combination of phonemes /a/+/m/+/o/ are used. The results also indicate that among the single phonemes, /o/ is more effective in differentiating the two groups than phoneme /a/ and /m/. The phoneme /a/ is produced while the tongue is pressed towards the jaw and the lips are wide open. Similarly, the production of the phoneme /m/ does not require voice box muscles because the lips are closed, and the air is passed through the nasal cavity. On the other hand, the production of phoneme /o/ requires precise positioning of the tongue at a mid-height position and the small-rounded position of the lips [50] than /a/ and /m/. Since the production of the phoneme /a/ and /m/ does not require the precise control of the tongue and lips, the tremor or weakness in the tongue or lips positioning should be more prominent in the production of /o/ than /a/ and /m/. This supports our finding that PD and HC are better distinguished with /o/ compared to /a/ and /m/. However, these are only logical deductions at this stage, and further research needs to be conducted to confirm these.
It was also found that the MFCC and the features from the first and second derivatives of MFCC of phonemes /a/, /m/, and /o/ were significantly different between PD and HC. The cepstral analysis identifies the changes to the source and vocal cord factors, and this observation confirms that Parkinsonian dysarthria is associated with these changes. The average log energy of phoneme /a/ was found to be significantly different which also indicates the reduced source strength of PD.
The significant difference between PD and HC of HNR and GNER of phoneme /o/ indicates the weakened vocal cords, due to which the relative voiced noise compared to resonatory sound is higher in the voice of PD. The classification results show that the inclusion of these features improves the model performance. The classification accuracy was 100% when using these features from the three phonemes, /a/, /m/, and /o/. Since, PD is a multi-symptom disease with complex display of the symptoms, and while the analysis of each phoneme captures some of the symptoms, it is the combination of all the three that appears to be capturing all the symptoms of the disease. The study has also investigated the effect of sampling frequency in differentiating between PD and HC. For sampling frequencies, fs = 48.1 kHz and 8 kHz, the model shows exactly similar results. This indicates that the relevant frequency of interest is less than 4kHz.
Further, this work explored the performance of the four feature selection algorithms for phoneme-based PD classification. Though ReliefF and ILFS slightly performed better than LASSO and UDFS, similar performance was noticed for the higher number of features. It was also observed that any top twenty features selected by any of the four-feature selection algorithms showed above 95% classification accuracy.
The performance comparison of our approach with the existing state-of-art techniques in the literature is summarized in Table 5. As shown in Table, the model performance for phonemes recorded in noise-free soundproof environment with a microphone varies from 89.5% to 97.7%. On the other hand, the model performance varies from 81% to 93.1% for phonemes recorded in a normal clinical setting. While the ambient noise resulted in a fall of performance of the models in literature by 5.6% to 8.4%, our proposed model was less prone to the ambient noise and capable of identifying PD from HC with 100% accuracy.
There are four major achievements of this study. Firstly, it has been found that people with PD and healthy age-matched have the most significant difference in the production of the phoneme /o/ which is differentiable even with background noise and recorded using handheld smartphone. The statistical analysis and classification results confirm that the voice features of phoneme /o/ can discriminate people with PD from HC participants more accurately than /a/ and /m / but the combination of phonemes /a/, /m/ and /o/ is the most accurate. Secondly, it has shown that computerized assessment of the voice of people with PD is suitable for real-world, regular clinical settings with background noise and using low sampling rate smartphone. Thirdly, this model requires only phonemes and thus, it is language independent. Finally, the model is trained and tested without favoring hyperparameters that are tailored to a specific gender, so this is a gender independent model.
The limitation of this study is that we did not consider factors such as accents because all participants were of sub-urban Melbourne only. There is also the need to test the individual multiple times to check for the repeatability of the results and to use multiple devices while this study used one phone only. Another weakness of this study was that people with PD were more than two years post-diagnosis and not in the very early stage of the disease.
Conclusion
This study has investigated the use of sustained phonemes for computerized diagnosis of PD based on the utterance of three phonemes /a/, /o/, and /m/ recorded using a handheld smartphone in real-world clinical conditions with ambient noise conditions of about 20 dB. It has been found that there were number of features with significant differences between PD and HC. After feature selection from the three phonemes, /a/+/m/+/o/, the classifier differentiated between HC and PD with 100% accuracy. Two prominent differences between PD and HC based on the selected features are a decrease in voice energy and increase in relative voice-noise. The novelty of this study is the selection of the acoustic features that are suitable for differentiating between PD and HC while using a handheld smartphone and is not sensitive to clinical ambient noise conditions. This study shows the potential of using phoneme based computerised diagnosis of PD that can be performed remotely using a smartphone. It has applications for assisting in the clinic or for telehealth.
ACKNOWLEDGMENT
The authors acknowledge the team at Dandenong Neurology and RMIT University which collected the data, and its online availability. Special thanks to Dr. Susmit Bhowmik and Dr. Sumaiya Kabir for their support and helpful discussion.