Intra-Native Accent Shared Features for Improving Neural Network-Based Accent Classification and Accent Similarity Evaluation

Accent similarity evaluation and accent identification are complex and challenging tasks for various applications due to the existence of variant types of native and non-native languages in the world. The lack of prior research on evaluating similarities between non-native and native English accents and the limitations of individual feature extraction methods for accent classification prompted us to introduce and propose a new model termed the intra-native accent feature shared-based native accent identification (NAI) framework using an English accent archive speech dataset. The NAI network was employed for non-native English accent classification, native English accent classification, and identification of native and non-native English accents. Finally, the accent similarity of native and non-native English accents was evaluated based on a delicate NAI pre-trained model. Moreover, the proposed approach has an innovative idea in training data augmentation to overcome the challenge of a huge amount of training datasets required for deep learning. The ordinary individual voice feature extraction with data augmentation and regularization techniques was the baseline for our work. The proposed approach boosted the accuracy of the baseline method with an average accuracy value of 3.7% -7.5% on different vigorous deep learning algorithms. The Quade test method for the performance comparison gave a 0.01 significant level (p-value) that proved that the proposed approach performed better than the baseline significantly. The model makes the rank for non-native English accents based on their similarity to native English accents and the proximity rank is Mandarin, Italian, German, French, Amharic, and Hindi.


I. INTRODUCTION
Accent recognition is one of the special types of speech recognition and it is a specific mode of pronunciation of a language. Native language, nation, social class, age, disease, emotion, and gender of people are factors that affect the accent. Even people who have the same mother tongue language, nationality, and social class may have variations in their accents. In fact, there are no people who have the same accent in the world, including identical twins. Moreover, there will be variations in a person's accent on some days due The associate editor coordinating the review of this manuscript and approving it for publication was Anandakumar Haldorai .
to disease, environmental change, and other factors. Although there are many distinct parameters to recognize the accent of an individual, accent recognition based on speakers' nationality and dialect is covered in this paper. Dialects are the major internal varieties of languages and it is the internalized linguistic knowledge. It is a particular form of language for a specific region or social group. Every speaker has a particular dialect of their language, which is mostly determined by the region of place of birth, and early life activity. Overall, it is a very challenging issue to identify the nationalities whose accents have more similarities.
Accent recognition is useful for linguistics (the science of language), speech-to-text transcription systems [1], forensic accent identification, speaker verification for secured voicebased systems [2], etc. Further, recognizing a non-native accent will aid to improve communication issues in various applications including eLearning systems [3], automatic international phone call services [4], international trade, and tourism. Similarly, the similarity evaluation between the native and non-native speakers of a specific language is valuable to measure the fluency of non-native language speakers and also is applicable for linguistics. However, accent similarity evaluation and accent identification are complex and challenging tasks due to the existence of variant types of native, non-native languages, ethnic groups, and other aforementioned factors. Although it is a daunting task, accent recognition is highly significant as the existence of various accents is a major factor that hinders the performance of speech processing systems. Therefore, accent recognition prior to speech recognition will aid in improving performance by minimizing the variation in accent.
The variety of dialects in languages makes NAI very challenging. When it comes to our accent detection tasks, we group speakers in the same class if they speak the same native language, even if they may have their sub-dialects. The lack of prior research on accent similarity proximity and the limitations of individual feature extraction methods motivated us to propose a new approach, the intra-native accent feature shared-based NAI framework. This paper presents this framework for accent classification and evaluates the accent similarity of native and non-native English accents based on a pre-trained NAI model. We also assess the proposed feature extraction method using different robust deeplearning algorithms.
The main contribution of this paper is enhancing the accent classification and accent similarity evaluation using ingenious intra-native accent shared features and deep learning-based native accent identification (NAI) framework. The proposed model is employed for three different types of accent classification problem domains, which are the classification of six non-native English accents, the recognition of 3 native English accents, and the identification of nonnative English accents from native English accents. Furthermore, the proposed model also compares the accent similarity between native and non-native English accents using the NAI as a pre-trained model and native English accent voices as test data. As per our best knowledge, there is no prior study for the intra-native accent shared features-based NAI, and also there is no existing non-native and native English accent similarity evaluation study. Our accent similarity evaluation model could be the baseline for any further accent similarity evaluation improvement. The proposed accent similarity evaluation method is tremendously helpful for linguistics that studies the scientific properties of particular languages as well as the characteristics of the language in general. Moreover, we noted that the spectrogram segmentation's frame size value for the proposed neural network-based NAI model needs a deep experimental investigation to achieve desirable model performance with an optimized computational time.
The most significant contributions of this study are summarized as follows: • This paper proposed a novel intra-native accent shared features-based NAI framework. To our knowledge, the proposed feature extraction approach has never been applied in the existing works.
• We conducted a deep investigation for obtaining a spectrogram frame size value that accomplishes an exemplary neural network model performance with an optimized computational time.
• We introduced and investigated the non-existing work which is the native and non-native English accent proximity evaluation model using the NAI pre-trained model.
• The performance of the proposed feature extraction is compared with the individual feature extraction baseline) method. The rest of the paper is organized as follows. We described the related works in Section II. Section III discusses the dataset and proposed NAI framework. Section IV provides our detailed experimental comparison and analysis. In Section V, conclusions and future work are discussed.

II. RELATED WORKS
In recent years, several works were carried out on accent recognition based on different machine learning algorithms, including deep learning algorithms. For instance, an automatic identification system for 6 different European countries' (Danish, German, British, Spanish, and Italian) accents identification based on the hidden Markov models (HMM) algorithm was investigated [5]. Arslan and Hansen [6] have shown that performing speakers' accent recognition prior to speech recognition improves the performance of the system using a distinguish training model for each accent of non-native American English language speakers and they achieved excellent results from the dataset of isolated words and phrases using HMM model. Peng et al. [7] have done a multilingual approach for realizing speech and accent recognition simultaneously using the deep neural network (DNN)-HMM framework. The authors have concluded that transfer learning has a positive impact on performance improvement for both speech and accent recognition tasks. Rizwan and Anderson applied [8] weighted accent classification using extreme learning machines (ELMs) and support vector machines (SVMs) to classify North American accents into seven groups using the TIMIT dataset. The ELMs method achieved 77.88% accuracy on the specified dataset. Further, Jiao et al. [9] proposed an accent classification system using a hybrid DNN and recurrent neural network (RNN) algorithm. The RNN and DNN models were trained on long-term features and short features, respectively. They have proven that their proposed system outperformed baseline systems on the INTERSPEECH 16 native language sub-challenge dataset to identify the native language of non-native English speakers from seven countries. Wang et al. [10] upgraded the state-of-the-art speaker recognition paradigm to deep accent recognition. They applied a deep framework with a discriminative feature learning method on the accent classification track of accented English speech recognition challenge 2020 (AESRC 2020). The Convolutional RNN algorithm was employed as a frontend encoder and they fused the local features using RNN to make an utterance-level accent representation. The performance showed better results compared to baseline systems. The deep belief networks classifier on Mel-frequency cepstral coefficients (MFCC) features was employed for foreign English accent classification and it carried out a good performance [11].
An accented spoken English corpus from 30 speakers was created from 6 different countries (China, France, Germany, Turkey, and Spain). Their method achieved 90.2% for 2 accented datasets and 71.9% for 6 accented datasets. The authors have proven that the deep belief network learning algorithm is much better than SVM, K-nearest neighbor (K-NN), and random forest algorithms. Weninger et al. [12] presented a classification of regional accents in Mandarin speakers for robust automatic speech recognition (ASR). The dataset for accent classification was collected from 15 different geographical regions of China. The bidirectional long short-term memory (Bi-LSTM), and i-vectors methods were proposed for the accent identification system. Finally, they evaluated the ASR on accented data. The 1-D convolutional neural network (CNN)-long short-term memory (LSTM) method achieved 94.9% accuracy on MFCC features for classifying the accent of three major Nigerian indigenous languages on the total of 6000 utterances dataset of 20 different words which were spoken by 300 speakers [13]. The CNN is better than the classical SVM model for speech's keyword and accent recognition, and surprisingly, a hybrid of CNN-SVM outperformed pure CNN and pure SVM [14]. The CNN-LSTM model was the most accurate and fastest compared to CNN, LSTM, and gated recurrent units (GRU) [15]. Likewise, the CNN-LSTM mode achieved the highest accuracy for NAI compared to the CNN, LSTM, Bi-LSTM, and GRU deep learning algorithms in our investigation too. In the existing works, authors obtained a single MFCC, spectrogram, or other extracted features from the same speaker. In our study, the intra-native accent shared features are used for improving deep learning-based accent classification and accent similarity evaluations. A single spectrogram feature is generated from a mixture of different speakers, but the same native language speakers. The performance of the proposed intra-native shared features technique is examined using robust deep neural network algorithms. The proposed approach enhanced the accuracy of the individual feature extraction baseline method on different deep learning algorithms.

III. DATASETS AND METHODS
The dataset for this work was prepared for experimental investigations and the essential dataset preprocessing technique was also performed preliminary to feature extraction. After the dataset was well prepared and preprocessed, the feature extraction techniques are applied. Finally, the processed data was fed to the proposed NAI networks. The dataset preparation and proposed method are distinctly discussed in the following sub-sections.

A. DATASET PREPARATION AND PREPROCESSING
In this paper, the accent archive dataset hosted by George Mason University on Kaggle was used for experimental investigation for all aforementioned tasks. To avoid the class imbalance problem, we selected a balanced number of speakers for each accent classification task. The speech accent archive uniformly presents a large set of speech samples from a variety of language backgrounds [16]. Native and non-native speakers of English read the same paragraph. The archive is used by people who wish to compare and analyze the accents of different English speakers. It contains native and non-native speakers of English, 2,140 speech samples, and participants from 177 countries with 214 different native languages.
Each speaker had read the paragraph: ''Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.'' The paragraph contains 69 words and it also contains most of the English consonants, vowels, and clusters.
The sample voices were recorded at the sampling frequency of 44.1 kHz, bit depth is 16 bits. Since the voices were recorded and stored in an MP3 sound file extension, we converted the audio files and stored them as WAV files. The paragraph can typically be read within 50 seconds, although each speaker's reading time is different. These long utterances were split into 2-second voices. Each of these 2-second voices was used as sample data for the training and test dataset. We reduced the noise before converting the voice signal to a spectrogram by applying the minimum meansquare error log-spectral amplitude estimator algorithm [17]. This noise reduction method is very effective to enhance noisy speech, and it significantly improves the quality of the speech. We selected non-native English accent speakers who are Mandarin native, Hindi native, Amharic native, French native, German native, and Italian natives. The native English accent speakers were selected from USA, Canadian, and UK. Some utterances were recorded from environments having background noise. We applied cross-validation for the training and test dataset separation of all accent classification tasks except the accent similarity evaluation. The whole dataset of native English speakers was assigned to test data whereas the whole non-native accent data was used as training data for the non-native and native English accent similarity evaluation. The summary of the dataset statistics is presented in Table 1.

B. PROPOSED NAI FRAMEWORK 1) INTRA-NATIVE ACCENT SHARED FEATURE EXTRACTION
In the proposed intra-native accent shared feature extraction, the original training and test voices are split into one-second and two-second voices, respectively. Every one-second training voice needs to be joined with other 5-one-second randomly selected intra-native speakers' voices. A single spectrogram feature is generated from a mixture of different speakers, but the same native language speakers. The twosecond intra-native accent shared voices are created and transformed into a spectrogram. Fig. 1 depicts how the proposed intra-native accent shared feature extraction is distinct from the individual feature extraction approach of ordinary speech recognition. Furthermore, the algorithm is written to emphasize the proposed feature extraction method.
Finally, the deep learning algorithms intake these shared features during the training phase. Since the new features are the intersect of the same native speaker's voices, the training data is becoming very learnable. The number of intra-native accents shared data are five times the original training data. This incremental of the training data is also convenient to reduce the challenge of a huge amount of training datasets demands of deep learning. This feature extraction method is only feasible for the training dataset and it never touches the test dataset. In the individual feature extraction baseline case, the training and test voices are split into 2-second voices. These 2-second voices are converted into a spectrogram. Hence, one spectrogram has an individual feature owing to being obtained from a single speaker.
During feature extraction of speech recognition, the audio signal is converted into frames on the Hamming window to get a spectrogram. The speech signal is converted into the frequency domain by applying FFT on the voice signal with a 23 milliseconds frame size and at a default sample rate of 22050 Hz. Detecting accents based on very long-term feature dependencies of voice is a complex task. To mitigate this challenge, we divide the feature dependency into 1-second intervals and apply the intra-native accent shared feature extraction. However, a state-of-the-art feature extraction approach is still required to tackle the issue of longterm feature dependency. To handle this, we utilize a combination of CNN for short-term features and LSTM networks for processing the long-term sequences resulting from the CNN. The NAI framework is employed for non-native English accent classification (Mandarin native, Hindi native, Amharic native, French native, German native, and Italian native), native English accent classification (USA, Canadian, and UK), and identification of native and non-native English accents using the deep learning, as depicted in Fig. 2.

2) DEEP LEARNING ALGORITHMS FOR NAI
In this paper, the proposed intra-native accent shared feature extraction technique is evaluated by the most popular and vigorous deep learning algorithms including, CNN, LSTM, CNN-LSTM, Bi-LSTM, and GRU. By considering the better performance, an intelligent hybrid CNN-LSTM-based NAI model is proposed for accent classification and  feature_k_ij=STFT(Shared_voice); /* STFT generates Intra-native shared features spectrogram*/

14:
Intra-native_class_label.append (k); //k is the class label  accent similarity evaluation, as shown in Fig. 3. A hybrid CNN-LSTM model has been proposed in last recent years to improve classification model performance. The performance of most deep learning classification models for time series data classification is highly suffered by the overfitting problem. Dropout reduces overfitting significantly and it performs more desirable than other regularization methods [18]. Besides these solutions, we have found the fusion of the CNN-LSTM model has the proficiency to reduce the overfitting issues [19]. Moreover, batch normalization and dropout regularization were applied to all mentioned deep learning algorithms to reduce the overfitting problems and these techniques were conducted to make a convenient comparison In this study, an LSTM deep learning algorithm is replaced on the fully connected layer of the CNN model. Originally, the LSTM is employed for deep feature extraction and classification. It has shown advanced performance for sequential data prediction and classification in many applications [20], [21], [22], [23], [24], such as time series trend forecasting, image classification, speech classification, hate speech detection, and sentiment analysis. Similarly, convolutional long short-term memory (ConvLSTM) [25], hybrid CNN and LSTM models [26], [27], [28], [29] have improved pure CNN and pure LSTM models. To overcome the vanishing gradient problem, LSTM was fused on the CNN, and the ReLU activation function was also applied to our NAI model. Since the LSTM controls the exploding gradient problems [30], we noted that replacing the fully connected layer of the CNN with LSTM reduced the vanishing gradient problem.
The LSTM consists of operations, activation functions, and states for receiving inputs over time. The CNN flattened vector output V (t=i) was assigned to the CNN's frame i input, where t is the time step, and i is the frame number. Each flattened vector is fed to the interconnected LSTM networks as x i at t = i. LSTM consists of an input gate, a forget gate, and an output gate, which are represented by i t , f t and o t , respectively. It has a cell state (c) and a hidden state (h), which are the long-term memory and short-term memory, respectively. In the LSTM gates, σ is the element-wise sigmoid function and tanh is the element-wise tangent activation function. The LSTM gates processed the flattened input vector (x t ∈ R N ×1 ) at t time-step with the previous short-term memory (h t−1 ), where N × 1 is the size of vectors.
Finally, the new cell state and the new short-term memory are computed according to: where c t−1 is the previous cell state,ĉ t is the candidate cell state, c t is the new cell state, and ⊙ is an element-wise product operator. The CNN-LSTM model requires a good adjustment of the frame size to design a worthy model. We simplified the complication by selecting an optimal and more generalized 2 i number of frames per single spectrogram and frame length for a 64×64×3 spectrogram image: where i = 0, 1, 2, . . . , 6, and L is the length of the Mel-spectrogram. We compared the performance among all 2 i frames experimentally, and we realized that a frame size of 16×64×4×3 per single spectrogram adjustment is surprisingly the best CNN-LSTM input from other frame sizes [19]. From the selected frame size, 16 is the number of frames, 4 is the width of the frames, 64 is the height of the frames, and 3 is the number of channels.

3) MODEL CONFIGURATION AND OPTIMIZATION
We optimized and configured the proposed neural networkbased NAI, as shown in Table 2. A comparison to be persuasive, the well-optimized and well-adjusted parameter setting was also designed for all deep learning VOLUME 11, 2023  models. Similarly, the model optimization settings of the CNN-LSTM model are also fixed, as shown in Table 3.
The binary cross-entropy loss function was employed for native and non-native English accent classification, whereas the categorical cross-entropy was applied for other remaining multi-class classification tasks. The cost function of the categorical cross-entropy is calculated as the average overall losses for the individual training samples, and the cost function for the binary cross-entropy is also calculated for binary class classification as follows, where M is the number of classes, N is the number of training samples, y is one hot encoding label which is the ground truth of each class, and y^= f (x; θ) is the probability of softmax prediction at the output layer of LSTM for each training sample, where x is feature space and θ is the set of parameters of the classifier.

4) NAI FRAMEWORK-BASED ACCENT SIMILARITY EVALUATION METHOD
In the exceptional case, the evaluation of the accent proximity between non-native and native English accents is an extremely arduous task because of the inadequacy of separating training data and test data. To address this complication, we have used the NAI as a pre-trained model and the native English accent voices are used as test data, as shown in Fig. 4. Although the training model has been trained by non-native English accents, the model is also able to classify the native accent based on their similarity. The relative accent similarity percentage (%) between native and non-native English speakers is calculated as given below:

RAS =
Total number of X − native prediction Total number of English native utterances × 100 (7) where RAS is the relative accent similarity between X -native and English native accents, and X is non-native English accents ( Mandarin, Amharic, Hindi, German, Italian, and French).

IV. EXPERIMENT TOOLS AND RESULTS ANALYSIS A. EXPERIMENT SETUP
The experimental results were investigated using the Kera framework on the frontend, and the Tensorflow framework on the backend for deep learning classification models using    the Python programming language on the graphics processing unit (GPU). We used the NVIDIA GeForce RTX 2080 Ti GPU with 11 gigabytes (GB) of dedicated memory. The compute unified device architecture (CUDA) toolkit for the GPUaccelerated applications and NVIDIA CUDA deep neural network (cuDNN) GPU-accelerated libraries for deep neural networks were installed and configured on Windows 10 Intel 64-bit operating system.

B. RESULTS AND DISCUSSION
The dataset is split into train data (70%) and test data (30%) randomly. We used early stopping to save a better validation accuracy model. The performance of the model is calculated as follows: Recall(%) = TP TP + FN ×100 (10) where TP is the true positive, TN is the true negative, FP is the false positive, and FN is the false negative. We examined the performance of the proposed intra-native shared features techniques using robust deep neural network algorithms. The summary of proposed intra-native accent shared and ordinary feature extraction performance by CNN, LSTM, CNN-LSTM, Bi-LSTM, and GRU models for NAI is shown in Table 4. Furthermore, we evaluated the overall performance of the model and conducted a further analysis using a weighted average of precision and F1-score metrics [31] in Table 5. The test sample predictions' confusion matrix for non-native and native English accents is shown in Table 6 and Table 7. Similarly, the model's performance accuracy graph for non-native English VOLUME 11, 2023  accent recognition, native English accent classification, and native and non-native English accent recognition using CNN-LSTM, are shown in Fig. 5 a, Fig. 5 b, and Fig. 6, respectively. Overall, the result summary showed that the intra-native accent shared feature extraction-based CNN-LSTM NAI model is much more powerful compared to pure deep learning algorithms.
Further, for good result analysis, we used the popular nonparametric statistical models' performance comparison techniques. The Friedman test, Friedman aligned ranks test, and Quade test help to compare multiple models' performance statistically [32], [33]. The significant level (p-value) < 0.05 indicates that one model is better than the other. We chose the Quade test for the three algorithms' performance comparison and the result gives a 0.02 p-value. The Quade test's result indicates that the proposed intra-native shared features of training data are more robust for deep learning-based NAI compared to the individual feature of training data, as Table 8 has exhibited. The average similarity percentage of non-native to native English accents is depicted in Fig. 7. For the model to be robust, the non-native accent classification models were trained using the whole non-native accent data. Since the hybrid CNN-LSTM and Bi-LSTM models achieved high accuracy for NAI compared to the aforementioned deep learning algorithms, the native and non-native English accent similarity evaluation is evaluated on CNN-LSTM and Bi-LSTM-based NAI pre-trained models. To convince the evaluation, the CNN-LSTM model was trained 2 times and the Bi-LSTM model also was trained 2 times per 500 epochs. Finally, the NAI-trained models have been tested 4 times on native English accent voices for accent similarity evaluations. The entire native accent voices were given to the pretrained model. The accent similarity percentage is the average of Bi-LSTM and CNN-LSTM model results, as Fig. 7 has shown. Since the NAI performance is improved, it is clear that the accent similarity evaluation method is also enhanced. According to the proposed model's result, the Mandarin native speakers' accent is much closer to the English native speaker's accent and the Hindi native speakers' accent is far away from the English native speaker's accent compared to others.
In general speaking, we observed that Mandarin, Italian, German, French, Amharic, and Hindi native accent speakers are first, second, third, fourth, fifth, and sixth-ranked, respectively based on their accent similarity to native English speakers. The reasons behind the lighter accents of native Mandarin speakers towards English compared to other nonnative speakers can be explained based on 3 facts as follows: (i) Mandarin is a logographic system, while most languages including English are phonographic systems. Hence, the linguistic system of Mandarin is very different from English compared to the ones of French, Italian, Germany, Amharic, and Hindi from English. Thenceforth, Mandarin native speakers view English as a completely new language to learn and are less likely to bring their native accent to English. (ii) Moreover, most of the English pronunciations can be found in Mandarin with similar pronunciations. This makes native Mandarin speakers easier to imitate native English speakers' accents. (iii) The variety of tones in Mandarin makes it easier for native Mandarin speakers to learn a new language with a less complex tone.

V. CONCLUSION
In this paper, the intra-native accent shared feature extraction is proposed for a deep learning-based NAI framework to enhance accent classification. Furthermore, the framework was used as the pre-trained model for native and non-native English accent proximity evaluation. The proposed model has achieved high accuracy compared to individual features of training data. The overfitting problem was reduced in the proposed system. Based on the given dataset and proposed model, the Mandarin native speakers' accent is much closer to the English native speaker's accent and the Hindi native speakers' accent is far away from the English native speaker's accent compared to others. Extending this work using many non-native English accents speakers' countries and a huge dataset size is under consideration for future work with some technical improvements to the current methodology.