Journals & Magazines >IEEE Access >Volume: 12

Code-Mixed Street Address Recognition and Accent Adaptation for Voice-Activated Navigation Services

Code-mixed street address recognition and accent adaptation for voice-activated navigation services

Abstract:

This study presents the development of a real-time application-specific Automatic Speech Recognition (ASR) system for voice-activated navigation services. The system is d...Show More

Metadata

Abstract:

This study presents the development of a real-time application-specific Automatic Speech Recognition (ASR) system for voice-activated navigation services. The system is designed to recognize Urdu-English code-mixed street addresses, which is challenging due to their complex nature and structure, especially in under-resourced languages such as Urdu. Two separate corpora are collected for ASR system development: Unicode Urdu consisting of general Urdu recordings of around 61.82 hours by 144 speakers and Roman Urdu-English code-mixed Addresses of around 16.89 hours by 20 speakers. The Unicode Urdu data is developed to provide acoustic models with general language understanding and code-mixed street addresses to provide code-mixing or switching coverage. The hybrid ASR system employed in this study plays a crucial role in addressing the multifaceted challenges of low-resource settings (only 16.89 hours of task-specific data), especially in the context of Urdu-English code-switching. The study compares various acoustic models, with mixed Time Delay Neural Network and Long Short-Term Memory (TDNN-LSTM) performing best with a Word Error Rate (WER), Character Error Rate (CER), and Sentence Error Rate (SER) of 4.02%, 0.8%, and 15.14% respectively, on random street addresses. In addition to testing street addresses, we performed accent-based and manual decoding testing on the developed ASR system. Results indicate the need to develop and deploy custom ASR systems for better accent adaptation and application-specific coverage. The developed ASR system is integrated into the TPL Maps (https://tplmaps.com/) mobile application. It is Pakistan’s first Large Vocabulary Continuous Speech Recognition (LVCSR) real-time system to provide Urdu-based voice-activated navigation services.

Code-mixed street address recognition and accent adaptation for voice-activated navigation services

Published in: IEEE Access ( Volume: 12)

Page(s): 168393 - 168411

Date of Publication: 12 November 2024

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2024.3496617

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Controlling smart devices using voice commands has now become a reality. In the past, Automatic Speech Recognition (ASR) was not a preferred method for human-to-machine communication. This was partly because technology then needed to be more mature and partly because other interaction methods, such as keyboard, mouse, touch, etc., were preferable. The use of ASR-based systems has recently increased considerably as technology is mature enough to be integrated into intelligent devices. Mobile applications such as Google Assistant, Amazon’s Alexa, Apple’s Siri, etc., [1], [2], [3], [4], [5] are redefining how we interact with our devices. ASR has proved to be a fast and convenient communication mode between humans and digital devices. Despite these advancements, most ASR research is performed on English language datasets due to easy availability and extensive community support. Also, most available ASR systems are trained using native English accents. These systems work best for native speakers, but the performance is drastically decreased for non-native English speakers [6]. Urdu is the lingua franca of Pakistan and Northern India. According to recent estimates, Urdu has over 230 million speakers worldwide and is considered the 10th-most widely spoken language [7]. With Pakistan having a literacy rate of around 62%, a considerable portion of the population does not speak or understand English [8]. Lots of people face a language barrier while accessing information in English. There is a need to develop application-specific speech recognition systems for Urdu. These systems can improve the way these people interact with their smart devices. Code-mixing is one of the significant challenges of developing a speech recognition system for real-time applications such as navigation [9] using ML approach. Code-mixing in speech processing refers to speech that contains words from multiple languages [10]. This is a common phenomenon in multilingual communities, where individuals switch between languages or mix elements of different languages in their speech. As English is commonly used for official communication in the subcontinent region, it is common to have code-mixing of English and Urdu while writing and speaking. Table 1 shows sample addresses in Roman and Unicode Urdu format, where code-mixing can be observed in the Roman Urdu version of addresses. English words mixed with Urdu are underlined in Roman Urdu for better visualization. Code-mixed speech usually has many challenges, including:

Lack of labeled data and resources, as gathering large amounts of code-mixed speech data is difficult.
High variability and complexity, involving multiple languages with different scripts, grammar, and vocabulary, leading to mixing at various levels (lexical, phonological, syntactic).
Context-specific words and phrases in regional accents make it challenging to develop models that perform well universally.
Urdu-English code-mixed speech is particularly difficult due to Urdu’s complex morphology and varying scripts.

TABLE 1 Address Samples in Roman Urdu (Mixed English Words are Underlined) and Urdu Unicode Format

Although many multilingual ASR solutions are available, application-specific systems are rare, especially for under-resourced languages such as Urdu. This study aims to develop an ASR system for Urdu-English code-mixed street address recognition by overcoming the associated challenges. The system is developed using Kaldi, an open-source speech recognition framework that supports statistical and deep learning-based ASR system development. This work compares speech modeling between deep learning techniques and classical statistical approaches. The developed system is a speaker-independent large vocabulary continuous speech recognition (LVCSR) system for efficiently recognizing Urdu-English code-mixed addresses. The main contributions of this work are as follows:

A novel approach for real-time Urdu-English code-mixed street address recognition and accent adaptation.
High-performance specialized ASR system developed using limited task-specific data by leveraging the phonetic coverage of colloquial Urdu.
Extensive testing (speaker-independent and accent-based), to highlight the need for customized systems in low-resource languages.
Step-by-step system evolution with analysis of error and real decoded examples to showcase development challenges and final performance.

The proposed system is developed for TPL Maps, Pakistan’s first digital mapping solution provider serving millions of real-time customers. The developed ASR system has enabled Urdu voice-activated services in their Maps application. This work is a first step towards developing Urdu-based speech recognition systems for specialized applications (like navigation). The rest of the paper is organized as follows: section II describes the latest trends and developments in Urdu speech recognition, section III presents the dataset used in the development of the proposed ASR system, section IV briefly describes how a typical speech recognition system works and what are its core components, section V presents the experimental setup describing various acoustic models trained and tested during the study, section VI describes the results of different test scenarios used to test the developed system, and finally, we conclude our findings in section VII.

SECTION II.

Related Work

This section reviews related works, examining their strengths, limitations, and significant advancements. It is organized into several subsections, each dedicated to a specific area of focus within the literature.

A. Urdu-Based Asr Systems and Task-Specific Applications

Although plenty of literature is available on the design and development of ASR systems, there is still a massive research gap in under-resourced languages such as Urdu [11], [12]. This section presents the work on Urdu ASR and task-specific ASR applications. Urdu speech recognition comes with different challenges for the research community, ranging from the availability of speech corpora to the development of phonetic lexicon. According to the literature, Chandio et al. developed a dataset of 25,518 speech samples for spoken Urdu digits ranging from 0 to 9. They also applied various classification approaches, including Support Vector Machine (SVM), Multilayer Perceptron (MLP), EfficientNet, and Convolutional Neural Network (CNN) for audio digit classification [13]. Nadimpalli et al. developed resources and benchmarks for Keyword Search (KWS) targeting six low-resource Indic languages (Gujarati, Hindi, Marathi, Odia, Tamil, and Telugu). Authors created keyword resources by reprocessing existing speech datasets considering factors like frequency, length, and potential confusion. Performance on developed KWS system is analyzed by comparing various models (GMM, DNN, and TDNN) [14]. Adeeba et al. developed an ASR system for Native Language Identification (NLI). The study utilizes spectrogram and cochleagram-based features from short speech utterances (0.8s on average) to identify the native language of Urdu speakers. Bidirectional Long Short-Term Memory (BLSTM) is employed to classify utterances among the native languages [15]. Wubet et al. developed a CNN-LSTM model for accent classification into native and non-native. The study bridges the research gap by evaluating similarities between non-native and native English accents using the proposed model. The study also ranked non-native accents (Mandarin, Italian, German, French, Amharic, Hindi) based on their similarity to native English accents [16]. In [17], authors used SVM to develop an ASR system for spoken Urdu character classification. Khan et al. collected a multi-genre Urdu Broadcast (BC) corpus of 98 hours of speech data from 453 speakers. The dataset is then used to develop the Urdu LVCSR system using TDNN acoustic model [18]. Mehreen et al. developed a large-scale publicly available corpus “Roman-Urdu-Parl” consisting of 6.37 million parallel sentence-pairs [19].

Many Urdu-based ASR systems are available, but task-specific ASR systems are still rare in the literature. Vekkot et al. developed a dementia speech dataset to address the lack of custom datasets for Indic languages. Due to the unavailability of clinical dementia datasets for Indic languages, the dataset is developed using translated recordings from an existing English dataset (DementiaBank). The study compared LSTM, BLSTM, and GRU (Gated Recurrent Unit) to detect dementia detection in Indic populations with an accuracy of up to 78% [20]. Raza et al. developed a spontaneous speech recognition system with a single speaker and medium vocabulary for Urdu using the Sphinx toolkit [21]. The study demonstrated that reading speech in spontaneous training data could decrease WER. Similarly, Asraf et al. presented a speaker-independent Urdu ASR system using the Sphinx toolkit with a limited vocabulary of 52 isolated words [22]. The work presented in [23] proposed a digit recognizer with an entirely Arabic environment using the Sphinx toolkit, i.e., it did not include Romanized scripts. Another isolated digit recognizer is presented in [24] and [25] uses the Multilayer Perceptron (MLP) to develop a similar model that recognizes Urdu digits. Research work presented by [26] developed a continuous Urdu speech recognizer that looks for pattern matching and acoustic-phonetic modeling and provides 55 to 60 % accuracy. Sarfraz et al. discussed some approaches for improving the recognition rates for Urdu speech recognition and presented acoustic models for robust Urdu speech recognition using CMUSphinx [27]. Another study led to the MoH (Map only Hindi) development for detecting hate speech in Hindi-English code-switched language [9].

B. Code-Switching and Multilingual Systems

Code-switching is another challenge in low-resource languages such as Urdu, where speakers combine elements from two or more languages within a single utterance or conversation. Sreeram et al. addressed the challenge of code-switching by exploiting the acoustic similarity to reduce the target set and proposing a novel context-dependent transduction scheme. The proposed approach is tested on the Hindi-English code-switching corpus, with considerable improvement in Target Error Rate (TER) and WER [28]. Farooq et al. developed a DNN-HMM-based Large Vocabulary Continuous Speech Recognition (LVCSR) system for Urdu-English code-switched conversations [29]. They used spontaneous Urdu speech corpus (25 hours) for system development, compensated with 10 hours of Urdu BC data. Manjunath et al. developed a methodology for multilingual phone recognition Multilingual Phone Recognition System (Multi-PRS). The study compared and evaluated strategies for multilingual phone recognition in code-switched and non-code-switched scenarios using Kannada and Urdu languages [30].

Ashraf et al. developed a transfer learning-based approach, Tran-Switch, for author profiles in code-switched English and Roman Urdu text [10]. The proposed method trains on a specialized mixed language model to provide code-switching coverage. In [31], authors proposed a speech recognition system for the Sichuan dialect using a combination of a Hidden Markov model (HMM) and a deep LSTM network. Dutta et al. improved the performances of the ASR system using a hybrid architecture combining Deep Neural Network (DNN) and HMM [32]. In another study, Qasim et al. proposed an Urdu speech recognition system designed explicitly for district names of Pakistan [33]. The authors discussed development challenges and solutions and concluded that an accent-independent system performs better for isolated words. Naeem et al. developed an Urdu ASR system using a subspace Gaussian mixture in which all HMM states share the same Gaussian Mixture Model (GMM) structure with the same number of Gaussians in each state [34]. The developed system shows promising results compared to similar statistical approaches. Emond et al. proposed a new metric for performance assessment of code-switched mixed speech called transliteration-optimized WER. They also proposed connectionist temporal classification acoustic model, along with Maximum Entropy (MaxEnt) and LSTM language models for bilingual code-switched Indic languages, including Urdu [35]. Similarly, in [36], authors investigated the effect of data augmentation on code-mixed Bengali-English speech. After data augmentation, they developed and tested MaxEnt and LSTM language models using transliterated Bengali and English corpora.

Ambili et al. proposed a methodology for spoken language identification for Indic languages, which is challenging due to similarities between them. The study compared the performance of featured extraction using three different pre-trained vision models: VGG16, RESNET50, and Inception-v3. These features are then used for spoken language identification using various classification algorithms. Analysis showed that features generated through VGG16 and Inception-v3 resulted in the best accuracy when classified using Artificial Neural Network (ANN) [37]. Rangan et al. proposed a methodology for developing a spoken Language Identification (LID) system using code-mixed (Gujarati, Telugu, and Tamil) with English [38]. The proposed method improved LID accuracy up to 3-5%. Similarly, in [39], Jain et al. investigated the effect of graphemic features on code-mixed Indic-English languages for spoken language identification.

C. Hybrid vs End-to-End Asr Systems

In speech recognition systems, two primary approaches are commonly considered: hybrid systems and end-to-end (E2E) systems. Each approach employs distinct methodologies for recognizing spoken language, and their effectiveness can vary based on factors such as available training data, computational resources, and language characteristics. Hybrid systems consist of multiple components, allowing for the integration of external language models and phonetic resources. E2E systems, on the other hand, utilize a single neural network to directly map speech input to text output, offering a streamlined architecture that simplifies the training process. E2E systems have achieved state-of-the-art performance on numerous benchmarks, particularly when trained on large datasets [40]. Their ability to leverage deep learning techniques can result in impressive accuracy in well-resourced languages and applications. However, their performance can diminish in scenarios with limited annotated data or complex linguistic environments, such as code-switching. In such cases, hybrid systems can excel, particularly by leveraging robust external language models and adapting to low-resource languages [41].

Arif et al. presented a benchmark on End-to-End (E2E) Urdu ASR models by a comprehensive evaluation of various models. The study uses WER as the primary metric for evaluating different models [42]. Mohiuddin et al. developed an end-to-end ASR model for Urdu, leveraging the XLS-R model based on the Wav2Vec2.0 architecture. The studies show that the fine-tuned XLS-R-300M model outperforms other E2E models, achieving a WER of 49%, compared to 57% for wav2vec2-large-xlsr-53 and even higher WERs for Whisper models [43]. However, the reported WER is still high compared to the WER achieved by hybrid systems on even smaller resources. These recent studies suggest that while E2E systems are rapidly improving and becoming more common in the realm of low-resource languages, hybrid models still hold an edge in WER performance specifically in task-specific applications. Hybrid speech recognition systems have distinct advantages in environments where data is scarce or unevenly distributed across languages. Their modular design separates acoustic, language, and pronunciation modeling, which facilitates the integration of domain-specific resources [44]. This is particularly beneficial in code-switching scenarios, where speakers frequently alternate between languages. Studies have demonstrated that hybrid systems can outperform E2E models in these contexts, as E2E systems often struggle to capture the intricate linguistic patterns involved without extensive annotated data [45]. However, it’s essential to acknowledge that E2E models are continuously improving. Advances in self-supervised learning and data augmentation techniques are helping E2E systems become more effective in low-resource scenarios. Moreover, E2E systems can simplify deployment and reduce latency in certain applications due to their unified architecture, which eliminates the need for multiple independent components.

In industrial applications, both hybrid and E2E systems have found success in various sectors such as customer service, healthcare, and legal documentation. For instance, in automated customer support, callers may switch between languages, challenging single E2E models to maintain accuracy across multiple language domains, especially in the case of low-resource languages. Khan et al. developed an ASR system for code-switched Urdu in noisy telephonic environments using a hybrid ASR system combining HHM with CCN-TDNN model [44]. The study demonstrated that for custom use cases especially in a low-resource language with code-switching hybrid systems achieve better performance. Hybrid models can utilize language-specific language models to ensure accurate recognition, while E2E models excel in environments where extensive training data is available to achieve high accuracy [46].

While E2E models may achieve higher accuracy in certain ASR benchmarks, hybrid models are still widely employed commercially due to their adaptability and efficiency. Factors such as streaming capability, latency, and the ability to integrate external resources play a significant role in the decision-making process for deploying ASR solutions [47]. In conclusion, while E2E systems continue to make significant strides in speech recognition, hybrid approaches remain valuable for tackling challenges associated with low-resource settings, industrial applications, and code-switching. To this end, since this paper focuses on an industrial use case, with limited task-specific data and complexity of code-switching hybrid approach is selected for the development of the ASR system. This choice of hybrid approach is further justified in the next section describing the dataset used to develop the ASR system.

SECTION III.

Dataset

The proposed ASR system is developed using two distinct datasets: a large-scale Unicode Urdu dataset and a specialized Roman code-mixed address dataset. The first dataset consists of 17,855 recordings from 144 speakers, containing Unicode Urdu transcriptions with a vocabulary size of 28,391 words. This dataset encompasses approximately 61.82 hours of speech data and occupies 7.1 GB of storage space. It is prepared using recordings from various sources, including Urdu news bulletins, talk shows, radio programs, and recordings of Urdu literature. Almost half of this dataset is prepared by transcribing recordings from online sources. The other half is developed by recording various Urdu manuscripts from different sources by speech and language technology research group members, students, and volunteers using the Urdu ASR recording portal developed by “Speech and Language Technology Group”. The extensive nature of this corpus provides a robust foundation for learning the fundamental phonetic patterns and acoustic characteristics of Urdu speech.

The second dataset is more specialized, focusing specifically on code-mixed Urdu-English addresses. It comprises 12,918 recordings from 20 speakers, with transcriptions in Roman Urdu script. This task-specific corpus has a more concentrated vocabulary of 6,194 words, spanning 16.89 hours of audio data and requiring 2.1 GB of storage. The smaller vocabulary size reflects its focused domain of address-related terminology. Both datasets are developed by the “Speech and Language Technology Research Group” at NUST. Transcripts available in both datasets are recorded at 16000 Hz. The regional diversity of the speech dataset is justified by the university’s diverse student body, representing all major regions of Pakistan. This ensures the dataset captures a wide range of accents, dialects, and linguistic nuances. The involvement of students and volunteers allows for continuous updates with new recordings, making the dataset a comprehensive resource for diverse speech patterns across the country. Statistical details of both datasets are described in Table 2.

TABLE 2 Statistics of Datasets Used in This Study

The unique challenge in this work stems from the different transcription schemes used in the two datasets - Unicode Urdu in the first corpus and Roman Urdu in the address-specific dataset. This diversity in transcription formats necessitated careful consideration of the ASR architecture due limitation of task-specific data. We opted for a hybrid ASR approach over E2E architecture specifically because of its inherent flexibility in handling multiple language modeling paradigms. The hybrid architecture’s modular nature allows independent optimization of acoustic and language models, enabling us to effectively utilize training data transcribed in different writing systems for the same language. This architectural choice was crucial, as E2E ASR systems typically maintain a rigid relationship between input acoustics and output transcription format, making it challenging to incorporate training data with varying transcription schemes. Also, the E2E approach normally requires a large amount of task-specific data Roman Urdu transcribed address in this case which is rare in case of low-resource language such as Urdu where even regular Urdu datasets are scarce. To this end, available task-specific data consisting of only 16.89 hours of recording was not sufficient to develop an efficient E2E model. The hybrid approach’s flexibility enabled us to leverage the phonetic richness of the larger Unicode Urdu dataset for acoustic model training while adapting the language model to handle Roman code-mixed transcriptions for the target address recognition task.

The dual-dataset approach, facilitated by hybrid architecture, strategically addresses the challenge of limited task-specific data in code-mixed address recognition. This training strategy, made possible by the hybrid ASR architecture’s flexibility, effectively leverages the complementary strengths of both datasets: the broad phonetic coverage from the larger corpus and the domain-specific features from the address dataset. The approach is particularly valuable in scenarios where collecting large amounts of task-specific data is challenging or resource-intensive, and where output format requirements differ from available training data transcription schemes. Using the above-described datasets, different versions of ASR systems are developed in this study (system S_U and system S_M). S_U is developed using only general Unicode Urdu data, while updated version S_M is developed using mixed data (general Unicode Urdu + English Urdu code-mixed addresses). Table 3 shows different versions of developed ASR systems and dataset/s used for their development. The results of S_U inspired the development of system S_M, an improvement over the initial system S_U. After developing and testing S_U, we found some issues, and the system S_M and English Urdu code-mixed address dataset were developed to address these issues. We further explain the reason behind the update of S_U to S_M in section VI the results and discussion. For reset of the paper, models trained for the specific system in Table 3 will be referenced with respective system names (S_U or S_M). The next section describes various components of a hybrid ASR system, before a deep dive into the development of various ASR systems in this study.

TABLE 3 Different Versions of ASR System Developed During This Study

SECTION IV.

ASR System Components

This section, briefly explains a typical ASR system and its various components. An ASR system estimates the most likely sequence of words for a given speech input. The automatic speech recognition process starts with feature extraction from the input speech. Feature extraction involves applying signal processing to enhance the quality of the input signal and transform input audio from the time domain to the frequency domain. Based on the features extracted, a set of acoustic observations $O = \{ O_{1}$ , $O_{2}$ , $O_{3}$ , $\ldots ~O_{k}$ } is generated given a sequence of words $w^{*} = \{ w_{1}$ , $w_{2}$ , $w_{3}$ , $\ldots ~w_{n}$ }. The decoder then estimated “the most likely word sequence $w^{*}$ for given acoustic observations”. As shown in equation 1, this can be mathematically formulated. $\begin{equation*} \boldsymbol {w^{*}} = \underset {i}{argmax} \left \{{{P(\boldsymbol {O}|w_{i})\:P(w_{i}) }}\right \} \tag {1}\end{equation*}$ View Sourcewhere $P(O|w_{i})$ , also known as the Acoustic Model (AM) score, is the probability of set acoustic observation given parameters of the acoustic model. $P(w_{i})$ , also known as the Language Model (LM) score, is the probability of words given parameters of a language model. Estimating the best acoustic and language model parameters is an active area of research in speech recognition. Figure 1 shows a flow diagram of an ASR system. The following text briefly explains various blocks in Figure 1.

FIGURE 1.

A typical ASR system.

Show All

A. Feature Extraction

Feature extraction, also known as speech parameterization, is used to characterize spectral features of an input audio signal to facilitate speech decoding. Mel-Frequency Cepstral Coefficients (MFCC) introduced by [48] is one of the most popular techniques for feature extraction in speech recognition systems. The reason behind the popularity of MFCC is its ability to mimic the behavior of the human ear. MFCC features can be directly used for speech recognition, but to get better performance, various transforms are applied to the results of MFCC. One of these transformations is Cepstral Mean and Variance Normalization (CMVN) [49]. CMVN is a computationally efficient normalization technique that reduces the effects of noise. Similarly, to add dynamic information to MFCC features, first and second-order deltas can be calculated. Given a feature vector of input observation, O first-order deltas can be calculated. $\begin{equation*} \Delta O_{t} = \frac {\sum _{i=1}^{n} w_{i}(O_{t+i} - O_{t-i})}{2 \,\sum _{i=1}^{n} w_{i}^{2}} \tag {2}\end{equation*}$ View Source

$w_{i}$ is the regression coefficients, and n is the window width. Second-order deltas can be derived from first-order deltas using equation 3. $\begin{equation*} \Delta ^{2} O_{t} = \frac {\sum _{i=1}^{n} w_{i}(\Delta O_{t+i} - \Delta O_{t-i})}{2 \,\sum _{i=1}^{n} w_{i}^{2}} \tag {3}\end{equation*}$ View Source

After first and second-order delta calculation, the combined feature vector becomes $\begin{equation*} \Delta O_{t} = [O_{t} \quad \Delta O_{t} \quad \Delta ^{2} O_{t}] \tag {4}\end{equation*}$ View Source

Other feature transformation techniques used in speech recognition are Linear Discriminant Analysis (LDA) [50], Heteroscedastic Linear Discriminant Analysis (HLDA) [51], Maximum Likelihood Linear Transform (MLLT) [52] and Feature space Maximum Likelihood Linear Regression fMLLR [53]. fMLLR can also be used as feature extraction for Speaker Adaptive Training (SAT) [54]. These transforms can be applied individually as well as in combination and can significantly enhance the performance of a speech recognition system. The implementation details of these techniques can be found in the respective papers.

B. Language Model

A language model used in ASR systems helps understand the structure of the language. A corpus is prepared to generate a language model based on desired transcripts. The language model contains the likelihood of the co-occurrence of words in the vocabulary. It determines $P(\boldsymbol {w*})$ , a hypothesized word sequence for provided acoustic features. $P(\boldsymbol {w*})$ can be further decomposed using the chain rule as shown in equation 5. $\begin{equation*} P(\boldsymbol {w^{*}}) = \prod _{i=1}^{n} P(w_{i}|w_{1}, \ldots, w_{i-1}) \tag {5}\end{equation*}$ View Sourcewhere $P (w_{i} | w_{1}$ , $w_{2}$ , …, $w_{i-1}$ ) in equation 5 is the probability of current word given previous history ( $w_{1}$ , $w_{2}$ , …, $w_{i-1}$ ). As creating a model given all possible word sequences is impracticable, n-gram model is used in current state-of-the-art approaches that limit the length of history to $n-1$ words. Table 4 lists various language models with the description of the corpus used to build these models. As explained in section III we initially developed system S_U and tested it using only addresses LM. After updating S_U to S_M for ASR system performance improvement, the final ASR system (S_M) is extensively tested using addresses and other LMs mentioned in Table 4. Further details on how these language models are used in ASR system development and testing are discussed in section VI. Although language model optimization is a hot area of research in speech recognition for this study, we only tested our ASR system on 3-gram language models. Testing language models for different values of n is not in the scope of this study.

TABLE 4 Different ASR Systems Developed During This Study

C. Acoustic Model

Acoustic Modeling in the ASR system estimates equation $P (O|w_{i})$ where acoustic model parameters are estimated by training the model. In speech data, the exact time of words in an utterance is unknown, so there is a level of uncertainty involved in training. Hidden Markov Models (HMMs) are used to model this temporal variability of speech. HMMs model a frame or window of frames of coefficients as state machines. Gaussian Mixture Models (GMMs) or Deep Neural Networks (DNNs) are then used to determine how well each state of each HMM fits frame/s of acoustic features to acoustic input. The nature of training in GMMs is usually generative, while in DNNs, it is discriminative. This study compares the GMM and DNN models to develop an ASR system for code-mixed address decoding.

GMM is a statistical generative model that is a common choice to estimate the distribution of output observations, while HMM is used to model temporal variability of speech. A combination of both models creates a joint acoustic model capable of describing the temporal and spectral dynamics of the speech. Figure 2 shows the architecture of an arbitrary GMM-HMM model for speech recognition where $\boldsymbol {O}$ is the output vector of the observation sequence. HMM model presented Figure 2 consists of several emitting states from $S_{1}, S_{2}$ up to $S_{k}$ and two non-emitting states (entry on the left and exit on the right). The model can be fully described using matrix $A = [{a_{ij}]}$ containing probabilities of all possible transitions from one state to another and emitting functions for observation $O_{i}$ . The basic training unit in speech is the phone instead of the word.

FIGURE 2.

Architecture of left to right GMM-HMM acoustic model with k states for speech recognition.

Show All

Deep Neural Networks (DNNs) are an alternative to the Gaussian mixture models [55]. A DNN is a neural network with more than one hidden layer between the input and output layers. In DNN, each hidden unit or neuron uses a logistic function that could be closely related to a hyperbolic tangent or any other function with a well-behaved derivative. In the case of multi-class classification problems like speech recognition, input observation is converted into class probabilities using the softmax activation function. An essential feature behind the success of DNNs in speech recognition is their ability to be trained discriminatively using back-propagation of cost function derivatives measuring the difference between the actual and predicted output for each training case. The natural cost function for softmax output is the cross entropy between the probabilities and output of the softmax. Figure 3 shows an arbitrary DNN-HMM architecture used for speech recognition tasks. The figure presents the model’s components and data flow, from input features to output probabilities. At the bottom of the figure, we have observations presented as a series of feature frames ( $O_{1}$ , $O_{2}$ ,…, $O_{t}$ ) over time, capturing the acoustic properties of the speech signal. These observations are fed to a DNN, consisting of multiple hidden layers. This deep architecture allows for hierarchical feature extraction and transformation where the input Layer directly processes the acoustic features from the observation. Hidden layers progressively abstract higher-level representations of the input. Finally, the output estimates probabilities for the HMM states. The HMM block models the temporal dynamics of speech including states and transition probabilities. States ( $S_{1}$ , $S_{2}$ ,… $S_{k}$ ) represent the hidden states of the HMM, typically corresponding to sub-phonetic units. Transition probabilities represent the probability of transitioning from one state to another. The DNN’s output layer provides emission probabilities for the HMM states, effectively replacing the traditional GMMs used in earlier systems. These probabilities, combined with the HMM’s transition probabilities, allow for decoding the most likely sequence of states given the acoustic observations. This architecture is a hybrid model because it combines the discriminative power of DNNs for acoustic modeling with the sequential modeling capabilities of HMMs. The DNN learns a complex mapping from acoustic features to state probabilities, while the HMM handles the temporal aspects of speech.

FIGURE 3.

A typical DNN-HMM model for speech recognition.

Show All

D. Phonetic Dictionary

The phonetic dictionary contains the mapping of words to respective phones. Phonetic dictionaries for both datasets are prepared separately. Around 20 % of the vocabulary for both datasets is manually converted to phones, and the rest is converted using the sequitur Grapheme-to-Phoneme (G2P) model [56]. Sequitur G2P is a data-driven technique to solve monotonous sequence translation problems (like word-to-phone conversion). Sequitur G2P has no built-in language specification and can be used for any language; provided example pronunciations for training G2P model. Training the grapheme-to-phoneme model makes the conversion process very fast. To prepare a G2P model, manually converted examples (pronunciation dictionary) of words to phones are used. Each line in the training dictionary has one word followed by its pronunciation. After training the G2P model, it can generate pronunciations of the remaining words in the vocabulary. For this study, we developed two G2P models, one for general Unicode Urdu data and the other for code-mixed Urdu-English address data. The conversion accuracy was manually inspected, and minor corrections were performed where required.

SECTION V.

Experimental Setup

Various acoustic models are developed and tested during the proposed ASR system development. All the models described in the following sections are trained for both systems (S_U and S_M). We trained seven (M₁ to M₇) different types of models to identify the best model for Urdu-English code-mixed address recognition. After completing the training, both systems (S_U and S_M) are extensively tested using different test sets and language models for performance assessment.

A. GMM-HMM

A phone is the smallest unit of speech. Each word consists of a sequence of phones. The number of phones in a language is far less than the number of unique words. If we use words as a training unit, the model must know each word in a language, making the problem’s dimensionality too high to handle. Words in the speech transcripts are converted to phones using a phonetic dictionary which contains phones against the words from the vocabulary. In addition to the dictionary, Out-of-Vocabulary (OOV) words are converted to phones using the G2P model, which is trained using manually converted words. A monophonic acoustic model is a model trained on individual phones. A better approach compared to mono phone modeling is triphone modeling. A triphone is a sequence of three phones, and it captures the context of the phone in the middle very efficiently. If there are N base phones, there are $N^{3}$ possible triphones. For this study, different GMM-HMM models are trained and tested to identify the best model. The first step to train GMM-HMM acoustic models is Mel-Frequency Cepstral Coefficients (MFCC) feature extraction; after MFCC feature extraction, CMVN transform is applied.

Table 5 shows various GMM-HMM models trained for this study with respective feature transforms. Further details on these techniques and transformations can be found in the respective papers, specifically in the section IV-A. For all triphone models, total leaves are set to 2000 total Gaussian is set to 11000 while training the model. The main drawbacks of using GMM-HMM are the assumptions we make when modeling speech. On the other hand, discriminatory training methods do not make any assumptions about the distribution of training data. It is one of the primary reasons behind the success of discriminative training algorithms, making them principal training algorithms in speech modeling.

TABLE 5 GMM-HMM Acoustic Models Trained With Respective Feature Transforms

B. DNN-HMM

Different deep neural network (DNN) architectures are designed and tested in the literature on speech recognition tasks. In this paper, the developments focus on M₆ (FNN) and M₇ (TDNN-LSTM).

1) Feedforward Neural Network (FNN)

The first DNN trained for this study is an FNN model consisting of an input layer, six hidden layers, and an output layer. Figure 4 and 5 show the architecture of trained FNN. To train an FNN model, we first need to transform our speech data into features. The length of the feature vector is 140, which is achieved after applying various transforms on the raw speech signal. Figure 5, shows a step-by-step process to transform raw speech into a compressed representation of size 140. This transformation requires various signal-processing steps. The feature conversion pipeline processes the input signal into frames of 25 milliseconds. Each frame is converted to an MFCC feature vector of size 13, spliced with ±4 to generate $13 \times 9$ frames. Splicing provides more context to the neural network than a single MFCC feature frame. After splicing, LDA is applied to reduce the dimensional of the features to a desired dimension (40 in our case), followed by a decorrelation step using Maximum Likelihood Linear Transform (MLLT). After the MLLT step, the fMLLR transform is applied for speaker normalization. Finally, these MFCC transformed features are combined with a 100-dimensional ivector. The same ivector is used for all acoustic utterances of a given speaker. ivector are compact statistical representations of speech utterances. These features are then used to train an 8-layer FNN model with six hidden layers for six epochs. Refer to the following study for more detailed insights about various feature pipelines for speech recognition [57].

FIGURE 4.

Feedforward neural network architecture.

Show All

FIGURE 5.

Feature pipeline for the FNN model.

Show All

2) Time Delay Neural Network - Long Short-Term Memory (TDNN-LSTM)

The second DNN trained for ASR development is TDNN-LSTM, which combines TDNN and LSTM layers to form a hybrid neural network architecture. Figure 7 and 6, illustrate the architecture and feature pipeline of TDNN-LSTM developed in this study. The input features are processed through a sophisticated pipeline before feeding into the network. Speech signals are processed in overlapping windows of 25ms with a 10ms shift to extract 40-dimensional MFCC features. These MFCC frames are concatenated in groups of N=5 frames. Additionally, a 100-dimensional ivector is computed over N concatenated MFCC frames and transformed using Linear Discriminant Analysis (LDA) to capture speaker and channel characteristics. Combining these features results in a 300-dimensional final feature vector that serves as input to the network. The architecture of TDNN-LSTM consists of 6 TDNN blocks interleaved with 3 LSTM blocks, arranged in a hierarchical structure to capture both temporal and sequential patterns. Each TDNN block follows a consistent structure comprising three layers: an affine transformation layer for linear projection, followed by a Rectified Linear Unit (ReLU) activation function for non-linearity, and a batch normalization layer to stabilize training. The network processes the input features as follows:

The first TDNN block linearly transforms the 300-dimensional input through its standard three-layer structure.
The second TDNN block operates with a context length of ±1, effectively processing information from adjacent frames ( $t-1$ , t, $t+1$ ), and is followed by a unidirectional recurrent LSTM layer block that maintains hidden states between time steps.
The third and fourth consecutive TDNN blocks follow the first LSTM layer. While structurally identical to the second TDNN block, these blocks expand their temporal context to ±3 frames, allowing them to model longer-range temporal dependencies.
A second LSTM block, identical in structure to the first LSTM block, processes the output of the fourth TDNN block, further enhancing the network’s ability to capture sequential patterns.
The fifth and sixth TDNN blocks maintain the expanded context length of ±3 and process the output from the second LSTM block.
The final (third) LSTM block concludes the main processing pipeline.

FIGURE 6.

TDNN-LSTM architecture.

Show All

FIGURE 7.

Feature pipeline for the TDNN-LSTM model.

Show All

The output layer of the network operates on 256-dimensional input provided by the third LSTM block. It consists of an affine transformation followed by logarithmic scaling and a softmax operation to generate posterior probabilities for the acoustic states. This final stage maps the network’s internal representations to phonetic state probabilities used in the hybrid ASR system. TDNN-LSTM is trained for six epochs [58], allowing sufficient time for the network to learn both short-term acoustic patterns through its TDNN components and long-term dependencies through its LSTM layers.

C. Computation Time

The evaluation of various acoustic models in Table 6 highlights key trends in computational requirements across traditional GMM-based systems (M₁ to M₅) and neural networks (FNN and TDNN-LSTM), for system S_M, trained on 78.7 hours of speech data As models progress from monophone (M₁, M₂) to more complex triphone configurations (M₃ to M₅), training times increase significantly, from 1.7-1.8 hours for monophone models to 5.8-9.3 hours for basic triphones, and up to 24.6 hours when using SAT. DNNs, despite GPU acceleration (Titan X - 12GB of 10Gbps GDDR5X VRAM), require much longer training times 30.2 hours for FNN and 68.5 hours for TDNN-LSTM due to their complexity and amount of training data.

TABLE 6 Comparison of Computation Time for System SM Model on 78.7 Hours Data (Unicode + Roman)

The Real-Time Factor (RTF) measures the ratio of processing time to actual audio duration. It is defined in equation 6 below. $\begin{equation*} \text { RTF} = \frac {\text {Time taken for processing}}{\text {Duration of the audio}} \tag {6}\end{equation*}$ View Source

An $\text {RTF} \lt 1$ means the system processes audio faster than real-time (e.g., an RTF of 0.5 indicates that processing is twice as fast as the audio duration). An $\text {RTF} = 1$ indicates real-time processing, while an $\text {RTF} \gt 1$ means the system processes audio slower than real-time (e.g., an RTF of 2 indicates that it takes twice as long to process the audio). Thus, RTF provides a way to evaluate the efficiency of speech processing systems.

RTF is an important factor to be considered during the real-time deployment phase. In this study, RTF is only measured for the final deployed model (TDNN-FNN) which shows promising inference efficiency. The TDNN-LSTM achieves 0.3-0.5 (2-3.3 times faster than real-time), indicating that despite the longer training times, neural models are practical for real-time applications.

D. Performance Assessment

A typical performance measure used to compare different ASR models is the percentage Word Error Rate (WER). WER is calculated using Levenshtein distance between words [59]. In Levenshtein distance, we count the number of insertions, substitutions, and deletions performed to equal two-word sequences. Depending upon the problem, the cost of insertions, substitutions, and deletions can be set. By default, this cost is similar for all operations and is set to 1. When the reference (ref) transcript is matched with the hypothesis (hyp), each word in the hypothesis is assigned a respective label based on whether it is an insertion (I), substitution (S), deletion (D), or correct (C). Equation 7 presents the formula for calculating WER. WER is calculated after each alignment during the decoding process. $\begin{equation*} \text { WER }(\%) = \frac {S_{t} + D_{t} + I_{t}}{N}\times 100 \tag {7}\end{equation*}$ View Source

In equation 7, $S_{t}$ , $D_{t}$ , and $I_{t}$ are total substitutions, deletions, and insertions, respectively. $N = (S_{t} + D_{t} + C_{t})$ is the number of words in the reference transcript. The additional term $C_{t}$ in N is the number of correct responses. Table 7 shows sample reference and hypothesis with labels from the decoding results of one of our models. Using equation 7 WER for the sample in Table 7 can be calculated where, $S_{t} = 1$ , $D_{t} = 0$ , $I_{t} = 0$ , and $C_{t} = 3$ respectively so, $N = 5$ . $\begin{equation*} \text { WER }(\%) = \frac {1 + 0 + 0}{1 + 0 + 4}\times 100 = 20\,\%\end{equation*}$ View Source

TABLE 7 Sample Reference and Hypothesis With Labels From Addresses Decoding

Other metrics that complement the widely used WER to evaluate the performance of ASR systems are Character Error Rate (CER) and Sentence Error Rate (SER). CER measures the edit distance between the recognized text and the reference text at the character level. It is particularly useful for languages with logographic writing systems or when dealing with continuous speech without clear word boundaries. CER is expressed as a percentage, with lower values indicating better performance. Equation 8 shows the formula to calculate CER, where $S_{t}$ , $D_{t}$ , and $I_{t}$ are total substitutions, deletions, and insertions, respectively at the character level and N total number of characters. $\begin{equation*} \text { CER }(\%) = \frac {S_{t} + D_{t} + I_{t}}{N}\times 100 \tag {8}\end{equation*}$ View Source

Using equation 8, CER for the sample in Table 7 can be calculated where, $S_{t} = 1$ (“a” is substituted by “i” in halar), $D_{t} = 0$ , $I_{t} = 0$ , and $C_{t} = 34$ respectively so, $N = 35$ . $\begin{equation*} \text { CER }(\%) = \frac {1 + 0 + 0}{1 + 0 + 34}\times 100 = 2.86\,\%\end{equation*}$ View Source

SER on the other hand quantifies the proportion of sentences that contain at least one error in the ASR output compared to the reference transcription. This metric is valuable for assessing the overall accuracy of the system at the sentence level, which is crucial for many applications requiring precise sentence-level understanding. $\begin{equation*} \text { SER} = \frac {\text {Number of sentences with at least one error}}{\text {Total number of sentences}} \tag {9}\end{equation*}$ View Source

These metrics explained above are very intuitive and straightforward measures that can compare different ASR models. In this paper, these metrics are used as a performance metric to analyze and compare the efficiency of various ASR models trained and tested during the development of the ASR system. In addition to these performance metrics manual analysis of the decoded transcripts is performed to analyze the model’s capabilities and limitations. This analysis also helped identify some areas of improved as future work of the study.

SECTION VI.

Results and Discussion

During the development phase, various acoustic models for ASR systems (S_U and S_M) are trained. After training, extensive testing is performed to investigate the efficiency of these systems. The following sections explain the testing results and the reason behind the development of different systems.

A. System S_U Testing

Initially, system S_U is developed using Unicode Urdu data. After the development, we tested all models (described in section V) using Addresses LM since the system was developed for code-mixed address recognition. Even though there is a difference in the alphabets of Unicode Urdu data and code-mixed addresses, the same acoustic model can be used as if both datasets contain similar phones. This is because acoustic models are trained on phones instead of textual representations and can be represented by word counterparts as far as they represent the same phones. Table 8 shows the system S_U results for different models described in the experimental setup using Addresses LM on 900 random Urdu-English code-mixed test addresses. Results in Table 8 show, that there is a clear performance improvement as we move from simpler models (monophone) to more sophisticated ones (TDNN-LSTM). This trend is consistent across all three evaluation metrics (WER, CER, and SER). The transition from monophone (M₁) to triphone models (M₂-M₅) shows a substantial reduction in error rates. The WER decreases from 49.42% to 19.64%, demonstrating the advantage of context-dependent acoustic modeling in capturing coarticulation effects prevalent in code-mixed speech. The FNN (M₆) and TDNN-LSTM (M₇) models significantly outperform traditional GMM-based models. The TDNN-LSTM model achieves the best results with a WER of 12.29%, CER of 2.46%, and SER of 40.82%. This substantial improvement underscores the importance of advanced modeling techniques for this challenging task. This also highlights the effectiveness of deep learning techniques in modeling the intricate patterns of code-mixed speech.

TABLE 8 WER, CER, and SER of Different Acoustic Models of System SU on Code-Mixed Address Using Address LM

1) Real-Time System Testing

The performance of the system S_U was satisfactory and after initial testing, the best model was hosted on the TPL server. After the deployment the model was tested on real-time voice queries from consumers using the TPL maps mobile app. The decodings against the recorded queries were observed by TPL staff for a few months. This analysis showed a critical problem in the system S_U. While S_U could decode most addresses correctly, we noticed that it did not correctly decode English numbers. This was because S_U was trained on general Unicode Urdu data with little coverage for code-mixing. As numbers are expected in addresses, this was causing performance issues. We updated system S_U to S_M to resolve this problem. The development and testing process for system S_M is described in the following sub-section VI-B.

B. System S_M Testing

To resolve the English number recognition problem in the system S_U, a new dataset of code-mixed addresses is developed and described in III. After dataset development, we used combined data (Unicode Urdu + Code mixed addresses) to train all acoustic models described in section V to develop an updated ASR system (S_M). After training, acoustic models are tested using Addresses LM on the same test set as system S_U. After this testing, it was observed that the system S_M started recognizing English digits properly, and WER was also considerably improved. Table 9 shows performance of models M₁ to M₇ developed for system S_M on same 900 test addresses. These results demonstrate a significant improvement in performance across all acoustic models compared to the previous system S_U. This enhancement can be attributed to the incorporation of code-mixed addresses in the training data, addressing the English number recognition problem previously encountered. The most striking observation is the substantial reduction in error rates across all models. Even the simplest model (M₁, monophone) achieves a WER of 12.12%, which is notably better than the best-performing model in the previous system S_U (12.29% for TDNN-LSTM, as shown in Table 8 of the previous discussion).

TABLE 9 WER, CER, and SER of Different Acoustic Models of System SM on Code-Mixed Addresses Using Addresses LM

Model progression is similar to the previous system, there is a clear trend of improvement as we move from simpler models to more complex ones. The progression from monophone (M₁) to Triphone (M₂-M₅) and then to neural network-based models (M₆-M₇) shows consistent error rate reduction. The triphone models (M₂-M₅) show significant improvements over the monophone model, with WERs ranging from 6.46% to 7.20%. This underscores the importance of contextual information in modeling code-mixed speech. The FNN (M₆) and TDNN-LSTM (M₇) models again demonstrate superior performance. The TDNN-LSTM model achieves the best results with a WER of 4.02%, CER of 0.80%, and SER of 15.14%. This represents a substantial improvement over the previous system’s best performance. The substantial improvement across all models emphasizes the critical role of diverse and representative training data. By combining Unicode Urdu and code-mixed addresses, the system gained robustness in handling the variability present in real-world code-mixed speech. After identifying the best model for system S_M based on testing results, the new model is deployed on the TPL server for future queries. We also tested system S_M on additional test scenarios to ensure the developed system is fully equipped to handle speech in the target accent.

1) New Speakers Testing

All the testing/decoding using system S_M is performed on the unseen testing data of speakers whose speech data is included in the training data. To test the system’s robustness and performance on the new speakers (out-of-training speakers) data, S_M models are tried on 50 random address transcripts from 4 unknown speakers. Performance results for the models M₁ to M₇ on these random speaker’s transcripts are presented in Table 10.

TABLE 10 WER, CER, and SER of Different Acoustic Models of System SM on Addresses Test Data for New Speakers Using Addresses LM

These results provide valuable insights into the generalization capabilities of system S_M when tested on new, unseen speakers. This evaluation is crucial for assessing the robustness and real-world applicability of the speech recognition system. Overall performance on new speakers remains good, with the best model (M₇, TDNN-LSTM) achieving a WER of 2.13%, CER of 0.43%, and SER of 8.25%, indicating the strong generalization capabilities of the system. Model progression is similar to the results on known speakers, there is a clear trend of improvement from simpler models to more complex ones. The progression from monophone (M₁) to triphone (M₂-M₅) and then to neural network-based models (M₆-M₇) shows consistent error rate reductions. The FNN (M₆) and TDNN-LSTM (M₇) models again demonstrate superior performance. The TDNN-LSTM model achieves the best results across all metrics, showing its ability to capture complex speech patterns and generalize well to unseen speakers. The superior performance of neural network-based models, especially the TDNN-LSTM, on new speakers underscores their ability to learn more generalizable representations of speech compared to traditional HMM-based approaches. It also shows the system’s ability to maintain high accuracy on unseen speakers indicates its readiness for real-world deployment, where it will inevitably encounter speech from unknown individuals. Results indicate that acoustic models of the system S_M are speaker-independent, and there is no noticeable decrease in performance when tested on new speakers.

2) Accent-Based System Testing

After testing on speakers, not in the training set, we focused on accent-based testing of our final system (S_M). Usually, speech recognition systems trained for a language with a specific accent do not perform well with different accents of the same language. We tested the system on two accent-based test scenarios to justify the need for a specialized model for target accents and test our system. First, we used LibriSpeech, a famous speech dataset recorded in the American accent. We recorded random transcripts from the LibriSpeech dataset in the target accent by different speakers. The same transcripts in American and accents are tested on our model and pre-trained LibriSpeech counterpart models using LibriSpeech LM. The comparative analysis presented in Tables 11 and 12 offers compelling evidence for the efficacy of accent-specific speech recognition models. The proposed system SM, developed for Pakistani-accented English, demonstrates superior performance on its target accent compared to the LibriSpeech pre-trained models. For the triphone model M₅, S_M achieved a WER of 12.46% on Pakistani-accented speech, significantly outperforming the M₅1 model from LibriSpeech (65.19%). Similarly, for the FNN models (M₆), S_M attained 7.26% WER on Pakistani-accented speech, compared 25.32% using M₆2 from LibriSpeech. Conversely, on American-accented speech, the LibriSpeech models maintained their superiority, with the FNN model achieving an 8.39% WER versus S_M’s 47%. Other performance metrics (CER and SER), show a similar trend to WER performance.

TABLE 11 WER, CER, and SER of the SM Model M5 (Triphone) vs. LibriSpeech Pre-Trained Counterpart Model M5 (Triphone) on Pakistani and American Accent LibriSpeech Recordings

TABLE 12 WER, CER, and SER of the SM model M6 (FNN) vs. LibriSpeech pre-Trained Counterpart Model M6 (FNN) on Pakistani and American Accent LibriSpeech Recordings

These results underscore the critical importance of accent-specific training in automatic speech recognition systems. The performance disparities observed between accents for both systems highlight the challenges of cross-accent generalization and justify the development of specialized models for different accents, even within the same language. Furthermore, the consistent outperformance of FNN models over their triphone counterparts suggests that more advanced neural network architectures can better capture accent-specific characteristics. This study not only validates the approach taken in developing the S_M system but also emphasizes the need for diverse, accent-specific corpora in building robust speech recognition systems for varied linguistic communities. Future research directions may include exploring more sophisticated model architectures, investigating transfer learning techniques for efficient accent adaptation, and developing multilingual and multi-accent systems to address the complexities of diverse linguistic environments.

Secondly, this accent testing is further extended by testing system S_M and LibriSpeech models on data from YouTube recodings. The results presented in Tables 13 and 14 comprehensively evaluate the proposed system S_M and LibriSpeech pre-trained models on real-world Pakistani-accented English speech collected from YouTube. This analysis, conducted across four speakers with 363 test audios, provides valuable insights into the system’s robustness and generalization capabilities in handling naturally occurring Pakistani-accented English speech. For the triphone models (M₅), as shown in Table 13, the LibriSpeech model struggled significantly with Pakistani-accented speech, yielding high error rates across all speakers (ranging from 58.57% to 81.91% WER) with an average WER of 66.62%. In contrast, the proposed S_M model demonstrated substantially better performance, achieving an average WER of 21.18%, with individual speaker WERs ranging from 9.30% to 29.15%. The corresponding CER and SER metrics show similar improvements, with the S_M model (M₅) achieving an average CER of 4.24% compared to LibriSpeech’s 13.32%. The FNN models (M₆), presented in Table 13, show even more promising results. While the LibriSpeech FNN model performed better than its triphone counterpart, achieving an average WER of 38.22%, it still fell short of the S_M FNN model’s performance. The proposed S_M FNN model maintained consistent performance across all speakers, achieving an average WER of 20.03%, with individual speaker WERs ranging from 17.05% to 23.83%. This represents a significant improvement over the LibriSpeech model’s performance, which showed higher variability across speakers (WERs ranging from 30.95% to 49.61%).

TABLE 13 WER, CER, and SER of the SM model M5 (Triphone) vs. LibriSpeech Pre-Trained Counterpart Model M5 (Triphone) on Pakistani Accent YouTube English Test Data

TABLE 14 WER, CER, and SER of the SM Model M6 (FNN) vs. LibriSpeech Pre-Trained Counterpart Model M6 (FNN) on Pakistani Accent YouTube English Test Data

These results are particularly noteworthy given that the test data was collected from YouTube content featuring Pakistani celebrities and politicians, representing natural, uncontrolled speech conditions. The consistent superior performance of both S_M models (triphone and FNN) across all speakers demonstrates the system’s robustness in handling real-world Pakistani-accented English speech. Furthermore, the reduced variability in error rates across speakers for the S_M models suggests better speaker independence compared to the LibriSpeech models. The performance gap between the LibriSpeech and SM models is more pronounced in the triphone architecture compared to the FNN architecture, suggesting that while more advanced neural architectures can better handle accent variations, the benefits of accent-specific training remain substantial regardless of model architecture. These findings further reinforce the importance of developing specialized acoustic models for specific accent groups and validate the effectiveness of the proposed approach in addressing the challenges of accent-specific speech recognition. Results indicate that system S_M models performed better than LibriSpeech models on English data recorded in the Pakistani accent. This shows the significance and the need for ASR systems based on Pakistani/South Asian accents.

3) Manual Decoding Analysis

An examination of the recognition errors reveals several distinct patterns that provide insights into the developed ASR system’s strengths and areas requiring improvement. We categorize and analyze these patterns to understand the underlying challenges in code-mixed address recognition in Table 15. The most prevalent category of errors stems from phonetic similarities between words, particularly in cases where the system substitutes words with similar pronunciations. In the case of “madina masjid” being recognized as “dina masjid”, the system fails to distinguish the subtle phonetic difference in the initial syllable. Similarly, “nirala sweets” being recognized as “rana sweets” shows confusion in words with similar phonetic patterns but different initial consonants. The substitution of “zeeshan” with “shan” indicates a challenge in recognizing word-initial phonemes, particularly in cases where the dropped syllable creates another valid word. The substitution of “do” with “2” indicates a challenge in distinguishing between numeric and word representations of numbers. Confusion between similar-sounding words like “main/mian” and “maul/north” suggests a need for better modeling of subtle phonetic distinctions.

TABLE 15 Example Decodings From the Best Model M7 (TDNN-LSTM) of System SM

The system shows good performance in handling alphanumeric components in most cases. Strong performance is observed in recognizing pure numeric sequences, as evidenced by the accurate recognition of “chak 33 3 r” in the rural health center address. The system successfully recognizes alphanumeric combinations like “a 1” in “a 1 pathology lab”, indicating robust handling of alphanumeric patterns. The system demonstrates interesting patterns in handling code-mixed elements. The system successfully recognized code-mixed phrases like “milk point” in “bilal milk point”. Shows consistency in maintaining English terms in their correct form when they are common in addresses (e.g., “center”, “workshop”, “system”). However, the system struggles with less frequent English terms, as seen in the misrecognition of “transports” in the “goods transports” address. Analysis of these examples provides valuable insights into the ASR system’s strengths and weaknesses. While the model demonstrates good performance in many areas, these specific misrecognitions highlight potential avenues for improvement, particularly in handling phonetic similarities, context awareness, and less frequent term scenarios. Addressing these issues through targeted training and dataset augmentation can enhance the robustness and accuracy of the speech recognition system. Future work describes in more detail how the developed ASR system can be improved further given the insights from the performance analysis.

This study is a joint effort of the National University of Science and Technology (NUST) and TPL Maps3 (a part of the TPL Corp) to develop Pakistan’s first voice-enabled navigation service for code-mixed Urdu-English addresses. The developed speech recognition systems can have several implications, including but not limited to (i) Specialized application: Most speech recognition systems are developed for general speech with little focus on the specialized task. This study presents application-specific speech recognition systems. The results could be used as a reference to improve code-mixed street address recognition for under-resource languages, (ii) Language-specific solution: This study presents a language-specific solution, which could be used as a reference for developing similar solutions for code-switched speech data, (iii) Multi-script speech corpora: A unique way of combining speech data transcribed in different scripts to develop a better-performing system is proposed and tested in this study. This subject is not well explored as many languages have such data but are not utilized efficiently, and (iv) Social impact: The availability of the developed solution as part of the free Maps application could have a positive social impact by serving millions of consumers and creating a more inclusive online environment. Future work will explore extending this system to include other regional accents and languages, as well as comparisons with other foreign accents, such as British English, etc., to enhance its generalizability,

SECTION VII.

Conclusion and Future Work

In this study, we developed a novel ASR system for Urdu-English code-mixed street address recognition and accent adaptation for a real-time application. Based on the literature analysis for low-resource and task-specific scenarios and for industrial applications the hybrid ASR systems often outperform E2E models by leveraging domain-specific acoustic and language models, ensuring greater accuracy and reliability in challenging environments. To this end, the hybrid ASR approach is selected over the E2E approach enabling the ASR system to handle datasets transcribed in two different Urdu writing styles, expanding its versatility. The system was developed using the Kaldi ASR toolkit and employed two distinct acoustic modeling techniques: GMM and DNN. Both GMM-HMM and DNN-HMM acoustic models were rigorously tested, and it was observed that models trained using deep neural networks significantly outperformed traditional Gaussian mixture models in all testing scenarios. Besides regular testing, additional out-of-training speaker testing is performed on the developed ASR system. Furthermore, extensive accent-based testing revealed the limitations of widely used pre-trained models on Pakistani-accented English data, underscoring the necessity for specialized accent-specific ASR models for low-resource languages. Results clearly indicate that accurate code-mixed address recognition is more efficiently achieved when using audio data recorded in local accents as opposed to foreign-accented audio. This highlights the critical need for such systems in real-world applications. This work represents the first large vocabulary continuous speech recognition system developed for code-mixed Urdu-English voice-activated navigation, marking an important step toward enhancing Urdu speech recognition in practical settings.

For future work, we aim to further optimize our DNN-HMM acoustic models by training them exclusively on Unicode Urdu data, to enhance performance in large vocabulary speech recognition tasks. Our research group also plans to expand and diversify our dataset to ensure it is more representative of the varied linguistic landscape. Once we have gathered sufficient data, we intend to explore end-to-end (E2E) ASR approaches, conducting a comparative analysis with the hybrid ASR method utilized in this study. This comparison will provide insights into the advantages and limitations of each approach in our specific context. Additionally, we see significant potential in adapting the developed ASR system for specialized applications, such as real-time Urdu speech transcription and automated mailing address recognition. This adaptation will facilitate seamless input in mailing systems and similar use cases, ultimately improving user experience and accessibility. By pursuing these avenues, we aim to contribute further to the advancement of ASR technologies tailored for under-resourced languages and applications.

References is not available for this document.

Code-Mixed Street Address Recognition and Accent Adaptation for Voice-Activated Navigation Services

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work

A. Urdu-Based Asr Systems and Task-Specific Applications