Acoustic Analysis of the Speakers’ Variability for Regional Accent-Affected Pronunciation in Bangladeshi Bangla: A Study on Sylheti Accent

Accented pronunciation variability is one of the key elements that deteriorate the accuracy of the automatic speech recognition (ASR). This article reports the results of the acoustic analysis of the two groups of speakers’ variability caused by regional accent in Bangladeshi Bangla. The analysis considers the seven monophthongal and four diphthongal vowels of Bangla to investigate the acoustic characteristics of two groups of single-accent speakers and their correlation on the articulation of the Standard Colloquial Bangladeshi Bangla (SCBB). An accent is the speaker’s regional signature and shaped by his/her community and educational background. This study examines both male and female speakers from the Sylhet region, which has one of the extremely deviant dialects in Bangla, and comparatively less deviant speakers from different districts of North-West and Middle Part of Bangladesh. Accent-related acoustic features such as pitch slope, formant frequencies, and vowel duration have been considered to examine the prominent characteristics of the accents and to classify the accents from these features. Both gender groups are distinctly analyzed. It has been found that there are significant deviations in formant frequencies and various steepness of the rise/fall in pitch slope within accents of both gender groups. In this study, it has been observed that accent related changes in speech affect the ASR performance. This has emphasized the need for accent-specific acoustic models to handle the speakers from highly deviant dialects as well as considering the accent-affected speakers’ variability in the corpora development for robust ASR system in Bangladeshi Bangla.


I. INTRODUCTION
In Bengali or Bangla ( ) language, there are many different accents among native speakers [2]. Geographically, one can divide them in two major regions: people of Bangladesh and people of West Bengal (a part of India) [3]. Some Bangla native speakers also live in other countries of the world. So, Bangla can be broadly classified into two main accent groups: Bangladeshi Standard Bangla and Kolkata The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney.
(capital of West Bengal) Standard Bangla. There are standard accents in every language; in English there are Received Pronunciation (RP) [43], which is a Standard British English accent, and General American (GenAm) [44], which is a Standard American English accent, etc. For Bangladeshi Bangla, Standard Colloquial Bangladeshi Bangla (SCBB -) is Standard Bangladeshi Bangla accent of the educated people of Bangladesh. It is the affiliation of the standard diversity of the spoken language in Dhaka and other cities of Bangladesh. SCBB also varies in phonetic and some other linguistic context from the Kolkata Standard [1].
There is also some significant difference within each of these two regions. According to P. Sloka Ray et al. 1966, there are some highly deviant regional dialects in Chittagong, Chittagong Hill Tracts, Sylhet, Rangpur, Mymensingh, etc. in Bangladesh [2]. The deviant dialect refers to the dialect that departs from accepted standard dialect of a specific language. In this article, we have reported the acoustic analysis of the accent-related features of eleven (11) vowels, seven of them are monophthongal vowels and four of them are diphthongal vowels. The analysis involved examining the accent-affected pronunciations effect of inter-speaker variability on the acoustic features for the two groups of single-accent speakers in Bangladeshi Bangla language. The corpus is formed from two groups of single-accent speakers in Bangladesh; one group from the Sylhet region that has extremely deviant dialect and the other group from different districts of North-West and middle part of Bangladesh, which have less deviant dialects. Furthermore, we have examined the interrelationship of the chosen vowels' acoustic features within each of these accents on the pronunciation of the SCBB. Four machine learning (ML) classification methods have been tried to classify the speakers' accent groups based on the accent-affected acoustic features. At the end of the article, we have reported the observation of the performance of two automatic speech recognition (ASR) systems on the accent groups.
The term dialect refers to the difference in pronunciation, vocabulary and grammar among varieties of the same language that form a particular speech pattern whereas the term accent refers to the distinct pattern of pronunciation [4], [5]. ''Accent'' of a language reflects the people of a geographical region and/or a socio-economic class to which they belong [4]. It also reflects the speakers' educational background [4]. Researchers had accomplished several kinds of accent-related acoustic analysis for various languages across the world [6]- [11]. There are no research findings on the Bangladeshi Standard Bangla, except for our own on the accent-affected acoustic features analysis of four (4) monophthongal vowels [12]. The accent analysis researches in other languages, had reported to have different accent-affected acoustic features that help us to know the regional accent effect on speech for a particular language community. These reported acoustic features are the first three formants frequencies, phone duration, intensity and pitch slope of vowel sounds [6]- [11]. The formant frequencies F1 (first formant), F2 (second formant), and F3 (third formant) are resting on the disposition of the vocal tract for utterance of different types of vowels. Research on the speakers of six different regions of America has shown that formants F1 and F2 are effected while vowel category differed significantly by the regional dialect [11]. Other researches had shown that the phone durations varied for different vowels across different regional accents for the same language [9], [10]. Formants F1, F2, and F3 significantly differed in some of the vowels for two well-known regional accents in British English [10]. Similar researches had shown that there is also significant effect of regional accents on the pitch slope among the same language [10], [13]. Therefore, the vowels' acoustic characteristics are essential to do accents analysis on a particular language community [9]- [11], [13]. Vowels have significantly more feature details for accent analysis, however this study restricts on the acoustic features of the Bangla vowel phonemes for investigating regional accent effect.
Formants represent the resonant frequencies of the vocal tract during the articulation. So, one can analyze the formants frequencies over time to investigate the effect of accents in vowel acoustics. Previous research has analyzed the formants frequencies for a specific vowel by analyzing the average frequencies over time [10]; whereas, in this study, the linear regression has been used to generate the formants contour which has given better generalized representation (see details in Section IV-A1 and IV-A2). Previous research on accent classification shows that acoustic features like: (i) Formants frequencies -F1, F2, and F3 (ii) Phone duration (iii) Intensity (iv) Mel-Frequency Cepstral Coefficients (MFCCs) and (v) the prosodic features such as pitch contour, rise/fall in the pitch slope are shown to differ significantly within regional accents [6], [7], [10], [14]. In this study, we have considered acoustic characteristics such as Formants frequencies (F1, F2, F3), Phone duration and rise/fall in the pitch slope for the regional accents classification using the four (4) ML methods (see details in Section II-B and IV-D). The compared and analyzed results of classifications methods also been presented.
Deep learning techniques are making a deep impact towards huge advancement in ASR system with a large vocabulary recognition [15]. However, the quality of ASR depends on the quality of the speech corpus. On the contrary, Bangladeshi Bangla has inadequate annotated speech corpora for large vocabulary continuous speech recognition (LVCSR) system. The quality of the corpus depends on the hours collected of speech as well as on the speaker variability [7]. This study has shown the necessity of investigating the regional accents variation in Bangladeshi Bangla to categorize the speaker variability and to build a quality speech corpus for the robust LVCSR system.  These are:  ,  ,  ,  ,  ,  ,  ,  ,  , , , , , [16], [17], [21]. Whereas, Abdul Hai (1967) and Daniul Huq (2002) have reported following 8 (eight) monophthongal vowels:  ,  ,  ,  , , , , and . They have ignored nasality of these phonemes because of their less frequent use in Bangladeshi Bangla [18], [19]. On the contrary of their claim most research studies have reported as a diphthongal phoneme. C. A. Ferguson and M. Chowdhury (1960) and S. Dowla Khan (2010), have reduced the monophthongal vowel phonemes number to 7 (seven): three front unrounded, three back rounded and a low neutral vowel (see Figure 1). Their corresponding nasal-vowels are much less frequent than the oral vowels in practice . The vowel phonemes are:  ,  ,  or , , , , and [1], [20]. Previous researches have not agreed on the actual numbers of the diphthongal vowels in Bangla. According to Abdul Hai (1967), there are about 31 diphthongal vowel phonemes [18]; on the other hand, Suniti Kumar Chatterji (1921) [16], [17]. Alam et al. (2007) has reported the union of all the distinct findings and studied total of 38 diphthongs [21]], whereas, S. Dowla Khan (2010) has listed only 16 of them [1].
In our study of accented speech data annotation, we have considered 11 (eleven) vowels; 7 (seven) monophthongs:  ,  ,  or  ,  ,  , , and and 4 (four) diphthongs: , , , . vowel is denoted by or in IPA, however we followed the Bangladeshi standard Bangla IPA [1] and used throughout this paper. In this study, the acoustic and prosodic features of these vowels have been explored and investigated. The script for recording the accented speech corpus contains SCBB sentences (see detail in Section II).

B. WHY SYLHETI ACCENT
Sloka Ray et al. 1966, has reported that Sylheti (dialect of Sylhet region) is one of extremely deviant dialects in Bangladeshi Bangla [2]. Sylheti is also recognized by some of the linguists as a language in its own right [22]. It has extreme diversity mostly on pronunciation and vocabulary and few on grammar among the Bangla language. It also has alternative script called ''Sylheti Nagri'' and used more Arabic and Persian words compared to Sanskrit as this was mostly used by the Muslim writers of the Sylhet region [23]. The ''Sylheti Nagri'' script has 5 (five) vowels and 27 (twentyseven) consonants. Its vowel phonemes are: (i) i, (ii) e, (iii) a, (iv) o, and (v) u [23]. The absence of '' '' and '' '' vowels in ''Sylheti Nagri'' script shows an evidence of significant difference in pronunciations of SCBB in the Sylheti accent.
The outline of this article is as follows. In Section II, we are going to discuss the accent Database preparation and describe the experimental setup for accent analysis. The acoustic feature extraction from the accented speech is described in Section III. The most important parts -the experimental results presentation, accent classification and ASR performance on accents are illustrated in Section IV. A discussion on results and how accent analysis is used for robust Bangla speech recognition and a conclusion are presented in Section V.

II. ACCENT DATABASE AND EXPERIMENTAL SETUP
Based on the findings of Sloka Ray et al. 1966 [2], we have chosen a group of speakers from a highly deviant dialect and another group of speakers, from some moderately deviant dialects with neutral accent of Bangladeshi Bangla. The first group of speakers are from Sylhet region and the second group of speakers are from North-West and middle part of Bangladesh. Our study hypothesizes that the people from highly deviant dialect (i.e., Sylheti) in Bangladeshi Bangla have a more accented effect on pronunciation of SCBB sentences than the people who have a neutral accent. The people, who have moderate deviant dialects and spent some notable time of their life in Dhaka and suburb of Dhaka, usually have the neutral accented SCBB. Based on the hypothesis, we have prepared our Accent Database (see detail in Section II-A) and done the experimental setup for acoustic analysis and accent classification of the accented speech (see detail in Section II-B).

A. ACCENT DATABASE
The 4 (four) male and 3 (three) female subjects have been chosen from Sylhet region; and 5 (five) male and 3 (three) female subjects from different districts of North-West and middle part of Bangladesh. The speakers have been chosen based on their distinguishable accent. For the Sylheti (SYL) accented group, one could easily find the effect of the regional accent in their speech. In the neutral (NEU) accented group, one could easily recognize the neutral accent of SCBB in their speech data. The subjects, who have been chosen from the Sylhet region, lived all their life in the Sylhet region. The speakers from the neutral accent were raised in their home districts, lived and educated in Dhaka or suburb of Dhaka. The speakers of both groups are undergraduate students of Shahjalal University of Science and Technology (SUST) community. From our previous study on Sylheti and neutral accent, it was found that the steepness of the rise and fall in /E/ vowel's pitch contour differed significantly among the two accent groups [12]. This finding helped us evaluate accent groups at the preliminary stage of the speakers' selection.

1) AUDIO RECORDING
The data corpus has been recorded using ''Audacity'' (a freeware for digital audio processing and recording) with a USB Audio/MIDI Interface ''M-Track 2 × 2'' and a Dynamic Microphone in a studio acoustics environment. A recording  script has been chosen with SCBB sentences. A story of nine sentences, which have been used in several research papers to study phonetics of standard Bangladeshi Bangla language [1], [19], has been chosen for recoding. All recorded speech has been sampled at 16 kHz. To maintain the recording and the voice quality, the recorded speech was double-checked manually, once during the recording and then during the annotation. The script has been recorded in a single session with each utterance of a single sentence. The recordings have been perceptually evaluated by a number of native-Bangla speakers, who are accustomed of both the Sylheti and neutral Bangladeshi Bangla accent. Because, research on listener's accent perception has shown that the listener's own accent fluency and a posteriori knowledge of specific accent affects his or her accent perception [37]. Furthermore, we have also evaluated the steepness of the rise and fall in vowels' pitch  contours after collecting the speech [10], [12], [13] (see example pitch contours are in the Figure 2). The detail of the recorded corpus is given in Table 1.

2) SEGMENTATION AND ANNOTATION
The recorded data has been segmented and annotated in the following ways: Two types of phonetic transcription systems are used for annotation, Bangladeshi Standard IPA transcription [1] and B-ToBI (Bengali Tones and Break Indices System) transcription [24]. Both types of transcriptions have been used for phoneme and syllable level annotation. ''Praat'' (Version 6.0.19) [31], a well-known speech analysis and processing software, has been used for segmentation and annotation. The vowel phonemes carry more temporal detail in the acoustic features than the consonants [7]. The accent mainly affects the vowel-related acoustic and prosodic features [7]. It is required to extract and analyze the vowels' features to evaluate the effect of regional-accents variation on the standard accent [6]- [11]. So, the vowel phonemes have been manually segmented using the ''Praat'' [31]. Figures 3 and 4 show the examples of phoneme level annotation for the Bangla word '' '' ( , /bErtho/) using ''Praat'' for two distinct male speakers from both accents. The chosen paragraph has covered all the monophthongs and 5 (five) diphthongs of Bangladeshi Bangla. In the script, the monophthongs were more frequent than the diphthongs. The script contained 101 words and out of these words 53 were chosen for the segmentation, labeling and extracting vowel acoustic features. Tokenization has been done in level by level. Initially we have tokenized the 9 recorded sentences, then the chosen VOLUME 8, 2020 53 words were tokenized. After that, these 53 words were tokenized in 108 syllables and 92 vowels. The occurrences of monophthongs were from 4 to 18 times while the occurrences of diphthongs were from 1 to 3 times. Only the diphthongs, which occurred more than twice, i.e., , , , and were considered here for accent analysis. The analyzed vowels with the corresponding words are listed in Table 2.

B. EXPERIMENTAL SETUP
For statistical analysis the mean of F1, F2 and F3, the pitch slope, and the duration of the vowels have been arranged word-wise, speaker-wise, accent-wise and then the vowelwise. The analysis was done using the statistical toolbox of MS Excel 2016. For both accents we have calculated the mean, the standard deviation and the variance of each vowel's acoustic-features (i.e. formant frequencies and phone durations). We also calculated the vowel-wise p-value of the one-tailed t-test and the two-tailed t-test across the two accents. Likewise, the accented vowel-wise mean of the pitch slope also has been calculated.
Furthermore, the temporal details of the extracted vowels' features have been saved as CSV (''comma-separated values'') files for ML-based analysis. Python 2.7 based machine learning platform the ''GraphLab Create v2.1'' was used to apply ML methods i.e., Linear Classification, Support Vector Machine (SVM), Decision Tree (DT) and Nearest Neighbor Classifier (NNC) for accents classification [25]. Similarly, the GraphLab Create -based Linear Regression method has been used to generate the formants contour from the temporal detail of the formants.

III. ACCENTED ACOUSTIC FEATURE EXTRACTION
Researchers have found a significant effect of accent on vowels' formant frequencies F1, F2, and F3, the pitch slope and phone duration [6]- [11]. In this study, these vowels' features have been extracted using the ''Praat'' [31]. By applying the Gaussian-like window and computing the LPC (Linear Predictive Coding) coefficients with Burg's algorithm [26], the Praat [31] performs a frame-wise short-term spectral analysis for tracking the formants. We used two types of Praat settings for Male and Female speakers, as shown in Table 3.
To differentiate the silence, voiced and unvoiced frames the voicing and silence threshold of Praat were set. For each vowel utterance, the mean value of the formants' frequencies has been measured from the several repeated frames for comparing the correlation between the accents. Depending on the vowel duration, 6-12 repeated frames were considered avoiding carefully the adjacent consonant's effect on that vowel utterance.

IV. ACCENTED FEATURES ANALYSIS
The differentiating features of the vowel sounds can be associated with the differences in their first three formant frequencies [28]. Articulation manners of vowels are uniquely different for every accent; so the vowel duration also differs among the accents. Prosodic feature, i.e. pitch contour, reflects the regional accent background. These above mentioned acoustic and prosodic parameters have been utilized in several research for accent classification [6], [7], [10], [14]. In this study, we have utilized these extracted features of the accented speech for both accent analysis and discrimination.
For the accent analysis, we have used the calculated mean, standard deviation, variance and the one-tailed and two-tailed t-test of all the extracted formants' frequencies for the chosen eleven vowels. The t-tests have been utilized to measure the statistical significance on the difference between two accents for F1, F2, F3 and phone duration. To make our formants analysis trustworthy, we have also crosschecked the calculated formant frequencies (F1-F3) from Praat with another publicly available formant tracker: that of DPPT (Differential-Phase Peak Tracking) algorithm [40]. For each extracted vowels, the pitch slope of the pitch contour has been calculated from the maximum change of rise and fall in minimum elapsed time. Then the means of the pitch slopes were calculated to analyze the steepness of rise and fall in vowel pitch contour across these two accents. To do the accent discrimination, some feature engineering have been applied on these extracted acoustic and prosodic features.

A. FORMANT FREQUENCIES ANALYSIS
Previous researchers, such as Zheng et al. [10], Adank et al. [9], Clopper et al. [11] and Ladefoged [27], have shown that F1, F2, and F3 formant frequencies have played huge role of holding the most noteworthy information of the vowels for the accent analysis and detection. As Sylheti dialect is one of the most deviant dialect in Bangladesh [2], we have investigated Sylheti accent effect on the pronunciation of SCBB dialogs by the speakers from Sylhet region. We have analyzed the F1, F2, and F3 frequencies to explore the correlation between SYL and NEU accented Bangladeshi Bangla speech.
Every speaker has their own configuration of vocal tracts into various shapes to articulate various types of sound in various accents. During the utterance, the resonant frequencies of the vocal tract can be modulated by the articulators such as movement of the palate, various parts of the tongue, the lips, the cheeks and the teeth. The manner of the articulation governs the vocal formants' frequencies. The first three formant frequencies (F1, F2 and F3) are important to understand the sound [10], [28].
The first formant frequency (F1) is associated with the jaw opening; the formant frequency rises as the jaw opens wider. So, this formant is interrelated with the vowel height, which is the distance between the tongue and roof of the mouth. The higher F1 frequency represent the open vowel that is lower the vowel height. Vice versa, the lower F1 represents the close vowel and higher vowel height [10], [28].
The second formant frequency (F2) is correlated roughly to the shape of the body of the tongue and the tongue advancement. This formant is mostly governed by the frontness or backness of tongue. Higher F2 formant represent the front vowels that is, the tongue body is in the front of the mouth and oral cavity is short. The back vowels have lower F2, because the position of the tongue body is in the back of the mouth, the month is elongated and pharynx is lowered [10], [28].
The third formant frequency (F3) differs with shape of the lip-rounding and also rest on the position of the vowel production. Higher F3 frequency relate with the rounded shape of the lip. Furthermore, both F2 and F3 are also associated with the lip-rounding and position of the vowel construction [10], [28].
Vowel formant frequencies are significantly different for male and female speakers depending on vowel, language and formant number [29]. So, we have presented our result of the formants analysis for male and female speakers in two different subsections (see Section IV-A1 and Section IV-A2). In Section IV-A3, We have illustrated the crosschecking Praat vs. DPPT formant frequencies (F1-F3) to make the formants analysis more reliable.

1) MALE SPEAKERS' FORMANTS ANALYSIS
The Figure 5a shows the eleven vowels distribution in F2-F1 formant space for the male speakers across NEU and SYL accents. Figure 5b shows the bar chart comparison of distance from the NEU accented and SYL accented vowels in F2-F1 formant space. From Figure 5a, it can be seen that the NEU accented /E/ sound is well seperated from the SYL accented /E/ sound. This is consisted with the fact that /E/ phoneme does not exist in Sylheti dialect, and the speakers tend to substitute the /e/ sound in its place. The Figure 5b shows that /E/ vowel has noteable distance between these accent groups. From Table 4, it can be seen that the vowel /E/ has higher F1 value for the NEU accent. The observed p-value (<0.0001) from the 1-tailed and 2-tailed t-test also indicates the significant difference in /E/ vowel for the F1 frequency. The SYL accented /e/ sound also differs from the NEU accented /e/ sounds. The Figure 5a shows that the SYL accented /e/ sound and /E/ sound are almost similar. Furthermore, in Table 4, p-value (<0.0001) from the 1-tailed and 2-tailed t-test suggest that both of these sounds significantly differed from the NEU accented sounds and /e/ vowel has higher F1 value in the SYL accent.
According to scatter plot of vowels in Figure 5a and bar chart of Figure 5b, it can also be seen that the NEU accented VOLUME 8, 2020   diphthongs /ey/ and /OY/ sounds have distinguishable distance in the F2 axis from the SYL accented diphthongs. However, p-values from the 1-tailed and 2-tailed t-test, from Tables 4 -6, suggest that the difference in /ey/ and /OY/ vowels for the F1, F2 and F3 frequencies are not statistically significant. In this study, we have less samples for the diphthongs ( Figure 6 shows the accent-wise no. of samples of the phonemes considered in the acoustic features analysis). The p-values from two t-tests suggest that there is not sufficient evidence to conclude about the differences in the F2 frequency for these diphthongal vowels.
The scatter plot (see Figure 5a) of the vowel distribution for the male speakers also shows that SYL accented /o/ and /O/ vowels are positioned closely in the F2-F1 formant space.  This indicates the fact that SYL accent cannot differentiate these sounds properly. It is not surprising as we know from the Section I-B, that ''Sylheti Nagri'' has the /o/ vowel but the /O/ vowel is missing. Besides, from the Figures 5a, 5b and Tables 4 -6, it can be seen that the SYL accented /o/ and /O/ sounds differ from the NEU accented /o/ and /O/ sounds. Also, the p-value (<0.0001) from the 1-tailed and 2-tailed t-test approves that NEU accented /o/ vowel differs significantly in the F1 from SYL accented one.
For the other three vowels /i/, /u/, and /a/, F1 has higher values in accent SYL. The p-values (<0.0001) from two types of t-test also suggest that there is significant difference in F1 for the /i/ sound (see Table 4). On the other hand, F2 has higher values for /u/ and /a/ in accent SYL while /i/ has higher F2 value in accent NEU. However, it turns out that these difference are not statistically significant (see Table 5). Furthermore, for the /i/ and /u/ vowels have higher F3 in accent SYL and /a/ vowel has higher F3 in accent NEU. The p-values of 1-tailed t-test indicate that F3 value difference for /a/ vowel is statistically significant (p-value is <0.001), but 2-tailed t-test give p-value≈0.001, which is suggest that there is no sufficient evidence to conclude about the difference in the F3 frequency for the /a/ sound among the accents (see Table 6). Figure 7a and 7b compares the average method and the linear regression method generated formants contour of /E/ vowel. It can be seen that linear regression method has given better generalized representation of formants contour than the averaged one. Figure 7 shows the formants frequencies variations in eleven vowels among the accented speech for male speakers. There is not much variation of F1 along the time dimension for the sounds /i/, /u/, /ey/ and /oy/ among these accents. It is seen that during the time of the articulation for the sound /E/ (see Figure 7b), the tongue was raised in accent SYL and lowered in accent NEU (F1 was higher). On the other hand, /o/ sound has the opposite trend (see Figure 7f); the tongue was lowered in accent SYL (F1 was higher) and raised in accent NEU. For the /e/ and /a/ vowel (see Figure 7g and 7h), the tongue has approximately the same position at the beginning of the articulation, but was lowered in the middle then raised again and had a similar position in the end for the SYL accented speech. Whereas, /O/ sound has opposite trend for F1 (see Figure 7d), the F1 was higher in the middle (tongue was lowered) then same again for both accent in the end in accent NEU. For /aW/ sound (see Figure 7i), for NEU accented speech, the F1 was higher (tongue was lowered) from the beginning to normalized time 0.6 then same again for both accents up to the end. Furthermore, for /OY/ sound (see Figure 7k), from normalized time 0.1 to 0.6, the tongue was lowered after that it was started to raise in accent NEU with respect to accent SYL.
From the Figure 7, it can be seen that there is not much variation of F2 along the time dimension up to normalized time 0.4 for the sounds /E/, /O/, /u/, /a/ and /o/ for both accents. After that the tongue was advanced in accent SYL than it was in accent NEU during articulation for these sounds. Whereas from normalized time 0.4 to 1.0, the F2 was higher for /i/ sound in accent NEU (see Figure 7c). Similarly, from normalized time 0.2 to 1.0, the F2 was higher for /OY/ sound in accent NEU (see Figure 7k). Here, F2 is increased indicate that the tongue is further advanced at its maximum point in the mouth in accent NEU. Furthermore, the tongue has a similar position for both accents for /oy/ sound (see Figure 7l). Whereas, for /ey/ sound, F2 was higher at the beginning of the articulation in accent NEU. Then it was VOLUME 8, 2020 decreased from normalized time 0.7 than it was in accent SYL (see Figure 7j). On the other hand, for /aW/ sound, F2 was higher at the beginning of the articulation in accent SYL. Then it was same for both accents from normalized time 0.8 (see Figure 7i). For /e/ sound (see Figure 7h), F2 was lower at the beginning of the articulation. Then it was same for both accent from normalized time 0.3 to 0.7. From normalized time 0.7 to 1.0, it was lower again in accent SYL. Here, F2 is lower means that the tongue is less advanced.

2) FEMALE SPEAKERS' FORMANTS ANALYSIS
In the Figures 8a, 8b and the Tables 7 -9, the formants analysis of the accented speech from the female speakers have been presented extensively. The Figure 8a shows the eleven vowels distribution in F2-F1 formant space in a scatter plot across two accents of Bangladeshi Bangla. However, the Figure 8b shows the horizontal bar graph comparison of distance among SYL versus NEU accented vowels in F2-F1 formant space. From the Figures 8a, 8b and the Tables 7 -9, it can be seen that there are almost similar trend of accent 35208 VOLUME 8, 2020   effect in the female speakers' vowels distribution in the formants space for the NEU and SYL accented speech like as the Male speakers' accent analysis results in the previous section (see Section IV-A1).
From the vowels distribution in the F2-F1 formants space (in Figure 8a), it can be seen that the NEU accented /E/ and /e/ sounds are well segregated from the SYL accented of these sounds. /E/ vowel has changed in F1 and /e/ has primarily changed in the F2 for NEU versus SYL accent. The difference of means among the accents in the F1 space for /E/ vowel is also statistically significant (p-values of 1-tailed and 2-tailed t-test are <0.001 -see Table 7). A similar trend has been VOLUME 8, 2020 seen in Male speakers' accent analysis. The Figure 8b shows that /E/ vowel has significant distance among accent groups. It also proves the fact that no /E/ phoneme exists in Sylheti dialect. For the SYL accent, both /E/ and /e/ sounds have closer position in the F2-F1 formant space, which indicates that these speakers tend to substitute the /E/ sound with the /e/ sound. Furthermore, the p-values of two types of t-tests (p-values of 1-tailed and 2-tailed t-test are <0.001 -see Table 8) indicate that the difference of means among the accents in the F2 space for /e/ vowel is statistically significant. The bar chart in the Figure 8b shows the distance among accents for the /e/ sound.
From Figure 8a and 8b, it can be also understood that the SYL accented /o/ and /O/ vowels are well separated from the NEU accented one. From literature review in Section I-B, it can be known that between these vowels, Sylheti accent has only /o/ vowel. In Figure 8a, these vowels have closer position in the F2-F1 formant space for the SYL accent. This accepts the fact that articulation manner of these two vowels are similar in SYL accent. Whereas, from the Figures 8a, 8b and Table 7, it can be seen that the NEU accented /o/ and /O/ sounds differ from the SYL accented of these sounds. The p-value (<0.001) from the 1-tailed and 2-tailed t-test confirms that the means difference between NEU and SYL accented /o/ vowel is statistically significant in the F1 (see Table 7). Moreover, SYL accented /o/ sound has higher F1.
The Figure 8a and 8b suggest that the NEU accented diphthongs /ey/ and /OY/ sounds have higher values in the F2 axis from the SYL accented of these diphthongs. They have also notable distance in the F2-F1 space. But, from Tables 7 -9, p-values from the 1-tailed and 2-tailed t-tests, it can be seen that the means difference in /ey/ and /OY/ vowels for the F1, F2 and F3 frequencies are not statistically significant. Moreover, in this study, these diphthongs have less samples ( Figure 9 shows the no. of samples of the phonemes considered in the accent analysis from accented corpus). The p-values, from the Tables 7 -9, indicate that there is not sufficient evidence to conclude about the means differences in the F2 formant space for these diphthongs. Other two diphthongs are closely positioned in the F2-F1 formant space (see Figure 8a). There is also no significant evidence of means difference in the F1, F2 and F3 among the accents for these diphthongs from the Tables 7 -9.
For the rest of the three vowels /i/, /u/, and /a/, there are no significant difference in the means of F1 among the accents. For the /i/ and /u/ vowels have closer value in F1 among the accents. The /a/ vowel has higher F1 in accent NEU, which is not statistically significant (see Table 7 ). However, Figure 8a shows that /i/ sound has higher value in the F2 axis in accent NEU. The p-values (<0.001) from two types of t-test also suggest that there is significant difference in F2 for this sound (see Table 8 ). Other two sounds, /u/ and /a/ have closer values in F2 across the accents. On the other hand, F3 has higher values for /i/ and /u/ in accent NEU and /a/ has higher F3 value in accent SYL; but these difference are not statistically significant (see Table 9 ). Figure 10 shows the formants frequencies' contour variations in eleven vowels among the accented speech for female speakers. There is not much variation of F1 along the time dimension for the sounds /i/, /u/, and /e/ among these accents. Whereas, for the sound /E/ (see Figure 10a), the tongue was raised in accent SYL and lowered in accent NEU (F1 was higher) almost all the time of the articulation. Then the tongue is raised in the end of the articulation in accent NEU. On the other hand, /o/ sound has the opposite trend S. Kibria et al.: Acoustic Analysis of the Speakers' Variability for Regional Accent-Affected Pronunciation in Bangladeshi Bangla (see Figure 10e); the tongue was lowered in accent SYL (F1 was higher) and raised in accent NEU. For the /a/ vowel (see Figure 10f and 10g), the tongue has approximately the same position (same value of F1) for both accents at the beginning of the articulation. But, the tongue was raised from the middle and up to the end of the articulation for the SYL accented speech. Whereas, /O/ sound has opposite trend for F1 (see Figure 10c), the F1 was higher from normalized time 0.1 to 0.6 (tongue was lowered) then it was getting lower in the end in accent NEU. But, from beginning to end of articulation SYL accent had lower F1 in /O/ sound. For /aW/ sound (see Figure 10h), for NEU accented speech, the F1 was higher (tongue was lowered) from normalized time 0.1 to 0.6 then it was getting lower up to normalized time 0.9. Furthermore, for /OY/ sound (see Figure 10j), from normalized time 0.1 to 0.6, the tongue was lowered after that it was started to raise in accent NEU with respect to accent SYL.
From the Figure 10, it can be seen that there is not much variation of F2 along the time dimension for the sounds /O/, /a/ and /o/ for both accents. For /E/ sound, the tongue VOLUME 8, 2020 was advanced in accent NEU than it was in accent SYL during articulation. Whereas, from beginning to end of the articulation, the F2 was higher for /i/ sound in accent NEU (see Figure 10b). For /OY/ sound, from the beginning of the articulation F2 was same for both accents. Then the F2 was getting higher in accent NEU (see Figure 10j). Here, F2 is increased indicate that the tongue is further advanced at its maximum point in the mouth in accent NEU. Furthermore, the tongue has a similar position up to normalized time 0.3 for both accent then F2 was getting higher up to normalized time 0.8 and staring to decreasing up to the end for /oy/ sound in accent SYL (see Figure 10k). Whereas F2 was higher at the beginning of the articulation then it was started to decrease in accent SYL than it was in accent NEU for /ey/ sound (see Figure 10i). On the other hand, F2 was same from the beginning of the articulation then it was getting lower from normalized time 0.8 in accent NEU for /aW/ sound (see Figure 10h). Here, F2 is lower means that the tongue is less advanced. For /e/ sound (see Figure 10g), F2 was higher during the whole articulation in accent NEU. Moreover, for /u/ vowel (see Figure 10d), F2 was lower from the beginning of the articulation in accent NEU then started to increase and has the same pattern from normalized time 0.5 up to end for both accents.

3) EXTRACTED FORMANT FREQUENCIES VERIFICATION
There are several formant trackers available for formant frequencies extraction. Some of the publicly available formant trackers are Praat [31], Wavesurfer [41], Winsnoori [42], DPPT [39], [40], [45] etc. These formant trackers are popular and reliable to speech-related clinicians, phoneticians, speech scientists, linguists etc. Most of these formant trackers use LPC-based formant estimation algorithms. Praat, Wavesurfer and Winsnoori are example of LPC-based formant tracker [31], [41], [42]. On the other hand, DPPT algorithm use differential phase spectrum processing for formant tracking [40], [45]. So, to validate the Praat formant frequencies, we have used the implemented DPPT algorithm form COVAREP (Cooperative Voice Analysis Repository) [39]. COVAREP is a publicly available repository for speech technologies [39]. DPPT is an efficient algorithm for formant tracking. The main advantage of DPPT, it can track high order formants effectively [40], [45]. The reason behind that the differential phase spectra has the spectral tilt-free property [40], [45]. It has been reported that, after comparing the DPPT with the Praat formant tracker in synthetic speech, ''the Praat's robustness on analysis of synthetic speech is lowest except for the F1 track'' [40]. They have also compared DPPT with the formant tracker of Praat and Winsnoori for the four real speech examples. They found that DPPT was best among the three methods for three of the four examples [40], [45]. It gave worst results among the three formant trackers for the fourth example [40], [45]. Whereas, the formant tracker of Praat was more consistent among the three algorithms for all of those four real speech examples [40], [45]. Bozkurt [45] has shown another test result after comparing the DPPT with the Praat and Wavesurfer formant trackers for the five male and five female real speech examples. These speech examples have contained Japanese, French, English and Danish sentences [45]. The research has reported that the results of the three formant trackers have provided equivalent and high quality formant tracks on this test set [45].
We have validated Praat vs. DPPT formant frequencies (F1-F3) for randomly selected 4 male speakers (2 NEU and 2 SYL accented) and 4 female speakers (2 NEU and 2 SYL accented). For these 8 speakers' accented speech, we have extracted the F1-F3 using the DPPT and compared our previously extracted F1-F3 from the Praat. We have only compared formant frequencies, where we have found statistical significant difference in vowels among the accents (see Section IV-A1 and IV-A2). It means that we have compared the result of /E/, /i/, /o/, /e/ vowels. With the Praat formant frequencies, we have found the statistical significant difference in F1 for all of these four vowels for male speakers among the accents (see Section IV-A1). Whereas, for female speakers, we have found /E/ and /o/ vowels have significant difference in F1, on the other hand, /e/ and /i/ vowels have significant difference in F2 (see Section IV-A2).
The cross-validation results between Praat vs. DPPT have been presented in Table 10 for male speakers and in Table 11 and 12 for female speakers among the accents. Form Table 10 -11, it can be seen that the means of F1 of Praat vs. DPPT have the closer values for both male and female within the same accent group. Furthermore, in Table 10, the means of F1 in DPPT show that /E/, /i/, /o/ and /e/ vowels have notable distance between the accent groups  for male speakers. The p-values (<0.001) of two types of t-test confirm that the difference of means between NEU and SYL accented of these vowels are statistically significant in DPPT's F1. This result is also valid for the Praat's F1 for male speakers. From Table 11, it can be seen that the means of DPPT's F1 for female speakers have significant distance between these accent groups for /E/, and /o/ vowels. The p-values (<0.001) of two types of t-test approve that the difference of means between NEU and SYL accented of these vowels are statistically significant in DPPT's F1 for female speakers. This result is also valid for the Praat's F1 for female speakers. Whereas, the means of F2 of Praat vs. DPPT for female speakers have the notable differences for /i/ and /e/ vowels within the accent groups (see Table 12). Also, DPPT's F2 have significant distance between these accent groups for those vowels (see Table 12). The p-values (<0.001) of two types of t-test confirm that the difference of means between NEU and SYL accented of these vowels are statistically significant in DPPT's F2 for female speakers. The means in Praat's F2 for female speakers also support the statistical significant differences among these accents for /i/ and /e/ vowels. Table 4 -9 show the Praat extracted formant frequencies analysis and from the comparison, it can be summarized that the formant frequencies achieved from the DPPT has held the similar statistical significant differences that have found with Praat extracted formants on these four vowels among the accents.

B. VOWEL PITCH SLOPE ANALYSIS
The prosodic feature, pitch or fundamental frequency (F0) contour is shaped by the several known factors such speaker's regional accent, language background, educational background, socio-economic class, anatomy, and emotional state [4], [10], [13]. During articulation, every language and accent have distinct patterns of intonation of speech that associate with the steepness of the rise and fall in the vowel pitch contour. The previous researches [6], [7], [10], [13] have shown that intonation play an important role to differentiate and investigate the foreign and regional accent influence; because the foreign or regional accented speech has different pitch slope from native or standard accented speech. The pitch slope can be computed by dividing the maximum change in the pitch contour in a minimum time elapsed for the target vowel [13]. The steepness of the rise and fall in vowel pitch contour is represented by the pitch slope. Figure 2 is the example of vowel's pitch contour patterns for male speakers from two accent groups.  Figures 11 and 12 show the result of the mean pitch slope analysis of the eleven vowels across SYL versus NEU accents for male and female speakers. For male speakers (see Figure 11), it can be seen that /E/ sound has a negative pitch slope in accent SYL and positive in accent NEU. On the other hand, /ey/ sound has the opposite trend -positive pitch slope in accent SYL and negative in accent NEU. The other vowels' pitch slopes have a similar trend for both accent groups of the male speakers. But the SYL accent group has steeper fall VOLUME 8, 2020 TABLE 13. Mean, Standard deviation of duration and p-value of the 1-tailed and 2-tailed t -test of eleven vowels across two accents -SYL and NEU for male speakers. for /i/, /O/, /u/, /e/ and /oy/ vowels. Moreover, NEU accent has steeper fall for /o/ and /a/ vowels. For the rest of the two vowels, /aW/ has steeper rise in accent NEU and /OY/ has steeper rise in accent SYL.
For the female speakers, (see Figure 12), it can be seen that /aW/ sound has a positive pitch slope in accent SYL and negative in accent NEU. On the contrary, /OY/ sound has the opposite trend -positive pitch slope and steeper rise in accent NEU and negative in accent SYL. The other vowels' pitch slopes have a similar trend among the accents for the female speakers. From the rest, most of the vowels -/i/, /u/, /e/, /ey/ and /oy/, have steeper fall in accent SYL. Moreover, accent NEU has steeper fall for /E/, /O/ and /o/ vowels. For the /a/ vowel, both accents have almost similar steeper fall in the pitch slope.

C. VOWEL DURATION ANALYSIS
Vowels duration rest on several factors -these are the manner of articulation, stress, speaking style, rhythm, the endpoints of word and syllable, the pause location in utterance, and vowels articulation before a voiced consonant or before the voiceless consonants. For each vowel, every accent has a unique set of the manner of articulation. During the articulation, the shape of the vocal tract, which can be modified by the articulators, cause the variation in the phone duration [10], [28]. Tables 13 and 14 show the mean, standard deviation and p-values of t-tests of the duration of eleven vowels across SYL versus NEU accents for male and female speakers. For male speakers (see Table 13 ), /E/, /O/, and /OY/ sounds have been shortened by the range from 3 ms to ≈7 ms for accent SYL. The rest of the eight vowels -/i/, /u/, /e/, /a/, /o/, /ey/, /aW/ and /oy/ have been lengthened by the range from 1 ms to ≈25 ms for accent SYL. On an average, accent NEU has shorter vowel duration. The average durations over all eleven vowels are 113 ms and 110 ms for accent SYL and NEU, respectively. For the SYL accent, /E/ and /O/ sounds have been shortened by a similar margin with a length of ≈7 ms. On the other hand, the SYL accented vowels -/i/, /u/, /e/, /a/, /o/ and /ey/ have been lengthened by a smaller margin with the range of 1.2 to 5.9 ms. Furthermore, /aW/ and /oy/ vowels have been lengthened by a bigger margin with the range of 17.6 to 25.5 ms for the SYL from the NEU accent. The rest of one vowel, /OY/ has been shortened by a smaller margin with a length of ≈3.6 ms for the SYL accent.
On the contrary, for female speakers (see Table 14), /u/ and /oy/ sounds have been lengthened by the near about 2 ms for accent SYL. Furthermore, /E/, /O/, /i/, /e/, /a/, /o/, /ey/, /aW/ and /OY/ sounds have been shortened by the range from 1 ms to ≈21 ms for accent SYL. On an average, accent NEU has longer vowel duration. The average durations over all eleven vowels are 101 ms and 108 ms for accent SYL and NEU, respectively. For the SYL female accent, /E/ and /O/ sounds have been shortened by a similar margin with a length of greater than 14 ms. On the other hand, the SYL accented vowels -/i/, /e/, /a/, /o/ and /OY/ have been shortened by a smaller margin with the range of 1.0 to 6.7 ms. Furthermore, /aW/ and /ey/ vowels have been shortened by a bigger margin with the range of 10.6 to 20.7 ms for the SYL from the NEU accent. The rest of the two vowels, /u/ and /oy/ have been lengthened by a smaller margin with a length of ≈2 ms for the SYL accent.

D. ACCENT CLASSIFICATION
In this section, the problem of accent discrimination for Bangladeshi Bangla is considered. Here, we have investigated the accent discrimination among the neutral accented speech and the regional accented speech from a highly deviant dialect in Bangladeshi Bangla. From the data analysis and discussion of the previous sections, it can be seen that the acoustic features, i.e., formant frequencies and phone duration, and the prosodic feature, i.e., pitch slope, have been varied in various degrees in the different accented speech. So, the contributions of these acoustic and prosodic features in the accent classification were investigated using the four machine learning algorithms.
For the accent classification experiment, we have only considered the male speakers' data samples. Since 92 nos. of data points of eleven vowels' features were extracted from 9 speakers, there was a total of 828 nos. of data for accented vowels from the male speakers. From the total data samples, 55% of them are NEU accented vowels. The training data (train) set contains ≈69% of the data (i.e., 571 data points); moreover, the cross-validation data (cv) set contains ≈15% of the data (i.e, 125 data points) and the test data (test) set contains ≈16% of the data (i.e, 132 data points). The three sets of features (see Table 15 ) have been examined using four different ML methods, i.e., Linear Classification, SVM, DT, and NNC from the GraphLab-Create toolkit. The several settings of the hyper-parameters have been examined among these ML methods and considered only those settings that have better accuracy on train, cv, and test set and better F1 scores on cv set.
The Table 16 shows that the linear or logistic classifier has better classification and accents detection with the Feature set-II and has a decent F1 score of 0.68 on test set. Although the SVM has better F1 score of 0.67 on test set for the Feature Set-I & II but it has a better classification and balance accents detection with the Feature Set-III with F1 score 0.63 on test data. Besides, the Nearest Neighbor Classifier has balance accents detection with all of the feature sets, however it has better classification performance on test data with the Feature Set-I. On the contrary, the DT method has both better classification and balance accents detection on test set with all of the feature sets. Furthermore, DT has best classification and accents detection performance on all of the data sets with the Feature Set-II and has the F1 score 0.72 on the test data. From the Tables 15 and 16, it can be also seen that Feature Set-I contain the principal features and all the ML methods have boosted their maximum accuracy of classification and accents detection based on these features. Though other additional features in Set II & III help these ML methods to tune their accuracy for better performance. Moreover, features Set-II contains an additional vector of 11 vowels (1 X 11) that  represent the pitch slope feature (rise = +1 or fall = −1) for a particular vowel, which is corresponding to the data point, and others are set to zero. From the data analysis of Section IV-B, it can be known that the changes (rise or fall) for the pitch slope for the accented vowels have differed among the accents.. From Table 16, it can be seen that with the features Set-II, all these ML methods have balance accents detection with better F1 scores on test data. On the other hand, in the feature Set-III, we have added another additional feature information about the accented vowel distance from the NEU accented centroid of the corresponding vowel in the F2-F1 formant space. By using the feature Set-III, we have achieved better accuracies on the train set for most of the ML methods, but the performances have decreased on the cv and test set.

E. ASR PERFORMANCE ON ACCENTS
We have developed an ASR trained with the ''Open SLR -Large Bengali ASR training data'' [30] using the starter code of Deep Speech 2 (DS2) [36], which is provided by Baidu Research. DS2 is an End-to-End deep learning system.
The model architecture of the DS2 is based on Recurrent Neural Network and usually trained with Connectionist Temporal Classification loss function (known as RNN-CTC). We have used improved MFCC method [38] for speech feature extraction and trained the RNN-CTC model. Then tested the performance with our accented speech corpus. The Open SLR data set contains ≈196k utterances (i.e. ≈250 hours) of speech. This End-to-End ASR system is released as a web app named ''Sukothon'' ( v0.1 Beta) [34] by the Department of CSE, SUST under the HEQEP-CP3888 1 project. Figures 13a and 13b show the performance of RNN-CTC based ASR system with and without language model (LM) on our accented speech corpus. The performance of the Google ASR system (with the Language option -'' '', which implies Bangladeshi Bangla language) has also tested with our accented speech corpus. To evaluate our accented speech corpus with our RNN-CTC based ASR and the Google ASR, we have divided the corpus into 6 (six) datasets -(a) Male SYL: speech data of the SYL accented male speakers (b) Male NEU: speech data of the NEU accented male speakers (c) Female SYL: speech data of the SYL accented female speakers (d) Female NEU: speech data of the NEU accented female speakers (e) Mixed SYL: speech data of the SYL accented speakers from both genders (f) Mixed NEU: speech data of the NEU accented speakers from both genders Figures 13a and 13b show the performance of our RNN-CTC based ASR system with and without the LM, respectively, on the above mentioned six datasets. From the Figures 13a and 13b, it can be seen that WER (Word Error Rate) and CER (Character Error Rate) of our ASR system on the above mentioned three SYL accented datasets (i.e., Male SYL, Female SYL, and Mixed SYL) is higher than the NEU accented three other datasets (i.e., Male NEU, Female NEU, and Mixed NEU). Besides, our ASR system with LM has lower WERs and higher CERs than our ASR system without LM on the respective datasets. The performance of our ASR system with LM (see Figure 13a) on the six accented datasets show that the WERs on the Male SYL and Mixed SYL datasets are ≈10% higher than the WERs on the Male NEU and Mixed NEU datasets, respectively. Furthermore, the WER on the Female SYL dataset is ≈7% higher than the WER on the Female NEU dataset (see Figure 13a). From the Figure 13a, it can be also seen that the CERs on the Male SYL and Mixed SYL datasets are 4.5% and 4.38% higher than the CERs on the Male NEU and Mixed NEU datasets, respectively. Furthermore, the CER on the Female SYL dataset is 3.81% higher than the CER on the Female NEU dataset (see Figure 13a). On the other hand, from the Figure 13b, it can be seen that the WERs of our ASR system without LM on the Male SYL, Female SYL, and Mixed SYL datasets are 9.34%, 4.54%, and 7.53% higher than the WERs on the Male NEU, Female NEU and Mixed NEU datasets, respectively. Moreover, the CERs on the Male SYL, Female SYL, and Mixed SYL datasets are 5.06%, 2.24%, and 4.01% higher than the CERs on the Male NEU, Female NEU, and Mixed NEU datasets, respectively (see Figure 13b). The performance of the Google ASR has been evaluated on the four accented datasets i.e., Male SYL, Female SYL, Male NEU, and Female NEU (see Figure 16). The WERs of Google ASR system on the Male SYL, and Female SYL datasets are 6.15% and 0.12% higher than the WERs on the Male NEU, and Female NEU datasets, respectively (see Figure 16).
We have also examined the character-wise error rate (%) of our ASR system without the LM on the four accented datasets, i.e., Male SYL, Female SYL, Male NEU, and Female NEU (see Figures 14 and 15). Each of these four accented datasets contains parallel text (the aligned text with the speech data), which has 43 unique Bangla characters. We have only investigated the output of our ASR without the LM because it can give us the characters' recognition performance of the RNN-CTC part of the ASR system. Through this examination, we can identify the deficit of the corpus that we need to improve for better RNN training. Figures 14 and 15 show the character-wise error rate (%) of our ASR system without the LM for these 43 characters. From Figure 15, it can be seen that for the male speakers, , ,  On the other hand,  ,  ,  ,  ,  ,  ,  , and have lower RER with the range from 5% to 10% in accent NEU, but they have a little bit higher RER in ≈7% higher WER and ≈4% higher CER in SYL accented speech for female speakers. Similarly, our ASR with LM system (see Figure 13b) has approx. 4.5% higher WER and approx. 2% higher CER in accent SYL for female speakers. From Figures 14 and 15, it can be concluded that most of the characters have higher RER for SYL accented speech. These trends imply that we need a different ASR system for SYL accented people or the people from a highly deviant dialect in Bangladeshi Bangla. The RNN-CTC performance on SYL accent (see Figures 14 and 15) also suggest that we should consider the variabilities of the speakers, which is caused by highly deviant regional dialect, to build a quality speech corpus for the robust LVCSR System in Bangladeshi Bangla.

V. CONCLUSION
In this study, the correlation between the two accents of Bangladeshi Bangla language is examined. The seven monophthongal and four diphthongal vowels of Bangla have been analyzed using the accent-related acoustic, i.e. formant frequencies and vowel durations, and prosodic features, i.e. pitch and pitch slope. The problem of accent classification for Bangladeshi Bangla is also studied. The Neutral accent of Bangla and the deviant Sylheti accent were chosen for this study. The results from the formant frequencies analysis show that the , and vowels formant frequencies have a significant difference between these two accents for both genders. The mean pitch slopes and the mean vowel durations of these vowels also differ between these two accents. The results show that NEU accented /E/ sound is well separated from the SYL accented /E/, which is consistent with the fact that /E/ phoneme does not exists in Sylheti dialect. Sylheti dialect has /e/ sound on their vowel phoneme inventory and so the SYL accented speakers tend to substitute the /e/ sound in its place. The paper has also reported that /E/ sound has a significant difference in F1 formant for both genders while /e/ sound has a significant difference in F1 formant for male speakers and in F2 formant for female speakers. The results also show that NEU accented /o/ sound is well segregated from the SYL accented one, and /O/ and /o/ sounds are placed closely in F2-F1 space. This observation is consistent with the fact that SYL accented speakers cannot distinguish these sounds properly. Sylheti dialect has /o/ sound on their vowel inventory, so the SYL accented speakers tend to substitute the /O/ sound with the /o/ sound. The paper has also showed that /o/ sound has a significant difference in F1 formant for both genders. Besides these findings, a new approach has been used to analyze the vocal tract shape in the accented speech more precisely. Instead of average method, linear regression has been used to generate the average contour of the formant frequencies of F1, F2, and F3 for each vowel. Linear regression has given better generalized representation of formants contour than the averaged one.
From the pitch slope analysis, it can be seen that there are lots of difference in the vowels' pitch slope between these two accents. The vowel duration analysis shown that, on the average, compared to accent SYL, accent NEU has shorter vowel duration for male speakers and longer vowel duration for female speakers. Classification results show that the acoustic and prosodic features play a significant role in accent classification. The ASR systems performance suggest the necessity of accent based ASR system for robust speech recognition for Bangladeshi Bangla. Though, we have investigated accent based speakers' variability among NEU and SYL accent on a small accent database. From the F2-F1 space, we have found significant differences in four identical vowels for both genders among the accents. Furthermore, other investigation results have also clarified that SYL accented speech has noteworthy deviant features than that of NEU. So, it can be said, this small dataset helps us make a good assumption that these results are also valid for other people with these dialects. Similarly, after having examined the correlation between the two accents of Bangladeshi Bangla language, it can be concluded that the people from highly deviant dialect (i.e., Sylheti) have a more accented effect on pronunciation of SCBB sentences than the people who have a neutral accent. Therefore, the hypothesis of our study is also proved to be correct (see the hypothesis in Section II). To investigate the regional accent based speakers' variability, many types of research already been performed through many languages (i.e. British English, American English, French etc.) These researches have helped them to build accent based robust speech recognizers and robust LVCSR System. This study coined the requirements of investigating the speech of the people from a deviant dialect in Bangladesh so that the speakers' variability in Bangladeshi Bangla can be considered before developing the speech corpus for a robust LVCSR System. This paper has also reported necessities of accent based Bangla ASR system for the people from an extremely deviant dialect in Bangladesh.