Introduction
Controlling smart devices using voice commands has now become a reality. In the past, Automatic Speech Recognition (ASR) was not a preferred method for human-to-machine communication. This was partly because technology then needed to be more mature and partly because other interaction methods, such as keyboard, mouse, touch, etc., were preferable. The use of ASR-based systems has recently increased considerably as technology is mature enough to be integrated into intelligent devices. Mobile applications such as Google Assistant, Amazon’s Alexa, Apple’s Siri, etc., [1], [2], [3], [4], [5] are redefining how we interact with our devices. ASR has proved to be a fast and convenient communication mode between humans and digital devices. Despite these advancements, most ASR research is performed on English language datasets due to easy availability and extensive community support. Also, most available ASR systems are trained using native English accents. These systems work best for native speakers, but the performance is drastically decreased for non-native English speakers [6]. Urdu is the lingua franca of Pakistan and Northern India. According to recent estimates, Urdu has over 230 million speakers worldwide and is considered the 10th-most widely spoken language [7]. With Pakistan having a literacy rate of around 62%, a considerable portion of the population does not speak or understand English [8]. Lots of people face a language barrier while accessing information in English. There is a need to develop application-specific speech recognition systems for Urdu. These systems can improve the way these people interact with their smart devices. Code-mixing is one of the significant challenges of developing a speech recognition system for real-time applications such as navigation [9] using ML approach. Code-mixing in speech processing refers to speech that contains words from multiple languages [10]. This is a common phenomenon in multilingual communities, where individuals switch between languages or mix elements of different languages in their speech. As English is commonly used for official communication in the subcontinent region, it is common to have code-mixing of English and Urdu while writing and speaking. Table 1 shows sample addresses in Roman and Unicode Urdu format, where code-mixing can be observed in the Roman Urdu version of addresses. English words mixed with Urdu are underlined in Roman Urdu for better visualization. Code-mixed speech usually has many challenges, including:
Lack of labeled data and resources, as gathering large amounts of code-mixed speech data is difficult.
High variability and complexity, involving multiple languages with different scripts, grammar, and vocabulary, leading to mixing at various levels (lexical, phonological, syntactic).
Context-specific words and phrases in regional accents make it challenging to develop models that perform well universally.
Urdu-English code-mixed speech is particularly difficult due to Urdu’s complex morphology and varying scripts.
Although many multilingual ASR solutions are available, application-specific systems are rare, especially for under-resourced languages such as Urdu. This study aims to develop an ASR system for Urdu-English code-mixed street address recognition by overcoming the associated challenges. The system is developed using Kaldi, an open-source speech recognition framework that supports statistical and deep learning-based ASR system development. This work compares speech modeling between deep learning techniques and classical statistical approaches. The developed system is a speaker-independent large vocabulary continuous speech recognition (LVCSR) system for efficiently recognizing Urdu-English code-mixed addresses. The main contributions of this work are as follows:
A novel approach for real-time Urdu-English code-mixed street address recognition and accent adaptation.
High-performance specialized ASR system developed using limited task-specific data by leveraging the phonetic coverage of colloquial Urdu.
Extensive testing (speaker-independent and accent-based), to highlight the need for customized systems in low-resource languages.
Step-by-step system evolution with analysis of error and real decoded examples to showcase development challenges and final performance.
The proposed system is developed for TPL Maps, Pakistan’s first digital mapping solution provider serving millions of real-time customers. The developed ASR system has enabled Urdu voice-activated services in their Maps application. This work is a first step towards developing Urdu-based speech recognition systems for specialized applications (like navigation). The rest of the paper is organized as follows: section II describes the latest trends and developments in Urdu speech recognition, section III presents the dataset used in the development of the proposed ASR system, section IV briefly describes how a typical speech recognition system works and what are its core components, section V presents the experimental setup describing various acoustic models trained and tested during the study, section VI describes the results of different test scenarios used to test the developed system, and finally, we conclude our findings in section VII.
Related Work
This section reviews related works, examining their strengths, limitations, and significant advancements. It is organized into several subsections, each dedicated to a specific area of focus within the literature.
A. Urdu-Based Asr Systems and Task-Specific Applications
Although plenty of literature is available on the design and development of ASR systems, there is still a massive research gap in under-resourced languages such as Urdu [11], [12]. This section presents the work on Urdu ASR and task-specific ASR applications. Urdu speech recognition comes with different challenges for the research community, ranging from the availability of speech corpora to the development of phonetic lexicon. According to the literature, Chandio et al. developed a dataset of 25,518 speech samples for spoken Urdu digits ranging from 0 to 9. They also applied various classification approaches, including Support Vector Machine (SVM), Multilayer Perceptron (MLP), EfficientNet, and Convolutional Neural Network (CNN) for audio digit classification [13]. Nadimpalli et al. developed resources and benchmarks for Keyword Search (KWS) targeting six low-resource Indic languages (Gujarati, Hindi, Marathi, Odia, Tamil, and Telugu). Authors created keyword resources by reprocessing existing speech datasets considering factors like frequency, length, and potential confusion. Performance on developed KWS system is analyzed by comparing various models (GMM, DNN, and TDNN) [14]. Adeeba et al. developed an ASR system for Native Language Identification (NLI). The study utilizes spectrogram and cochleagram-based features from short speech utterances (0.8s on average) to identify the native language of Urdu speakers. Bidirectional Long Short-Term Memory (BLSTM) is employed to classify utterances among the native languages [15]. Wubet et al. developed a CNN-LSTM model for accent classification into native and non-native. The study bridges the research gap by evaluating similarities between non-native and native English accents using the proposed model. The study also ranked non-native accents (Mandarin, Italian, German, French, Amharic, Hindi) based on their similarity to native English accents [16]. In [17], authors used SVM to develop an ASR system for spoken Urdu character classification. Khan et al. collected a multi-genre Urdu Broadcast (BC) corpus of 98 hours of speech data from 453 speakers. The dataset is then used to develop the Urdu LVCSR system using TDNN acoustic model [18]. Mehreen et al. developed a large-scale publicly available corpus “Roman-Urdu-Parl” consisting of 6.37 million parallel sentence-pairs [19].
Many Urdu-based ASR systems are available, but task-specific ASR systems are still rare in the literature. Vekkot et al. developed a dementia speech dataset to address the lack of custom datasets for Indic languages. Due to the unavailability of clinical dementia datasets for Indic languages, the dataset is developed using translated recordings from an existing English dataset (DementiaBank). The study compared LSTM, BLSTM, and GRU (Gated Recurrent Unit) to detect dementia detection in Indic populations with an accuracy of up to 78% [20]. Raza et al. developed a spontaneous speech recognition system with a single speaker and medium vocabulary for Urdu using the Sphinx toolkit [21]. The study demonstrated that reading speech in spontaneous training data could decrease WER. Similarly, Asraf et al. presented a speaker-independent Urdu ASR system using the Sphinx toolkit with a limited vocabulary of 52 isolated words [22]. The work presented in [23] proposed a digit recognizer with an entirely Arabic environment using the Sphinx toolkit, i.e., it did not include Romanized scripts. Another isolated digit recognizer is presented in [24] and [25] uses the Multilayer Perceptron (MLP) to develop a similar model that recognizes Urdu digits. Research work presented by [26] developed a continuous Urdu speech recognizer that looks for pattern matching and acoustic-phonetic modeling and provides 55 to 60 % accuracy. Sarfraz et al. discussed some approaches for improving the recognition rates for Urdu speech recognition and presented acoustic models for robust Urdu speech recognition using CMUSphinx [27]. Another study led to the MoH (Map only Hindi) development for detecting hate speech in Hindi-English code-switched language [9].
B. Code-Switching and Multilingual Systems
Code-switching is another challenge in low-resource languages such as Urdu, where speakers combine elements from two or more languages within a single utterance or conversation. Sreeram et al. addressed the challenge of code-switching by exploiting the acoustic similarity to reduce the target set and proposing a novel context-dependent transduction scheme. The proposed approach is tested on the Hindi-English code-switching corpus, with considerable improvement in Target Error Rate (TER) and WER [28]. Farooq et al. developed a DNN-HMM-based Large Vocabulary Continuous Speech Recognition (LVCSR) system for Urdu-English code-switched conversations [29]. They used spontaneous Urdu speech corpus (25 hours) for system development, compensated with 10 hours of Urdu BC data. Manjunath et al. developed a methodology for multilingual phone recognition Multilingual Phone Recognition System (Multi-PRS). The study compared and evaluated strategies for multilingual phone recognition in code-switched and non-code-switched scenarios using Kannada and Urdu languages [30].
Ashraf et al. developed a transfer learning-based approach, Tran-Switch, for author profiles in code-switched English and Roman Urdu text [10]. The proposed method trains on a specialized mixed language model to provide code-switching coverage. In [31], authors proposed a speech recognition system for the Sichuan dialect using a combination of a Hidden Markov model (HMM) and a deep LSTM network. Dutta et al. improved the performances of the ASR system using a hybrid architecture combining Deep Neural Network (DNN) and HMM [32]. In another study, Qasim et al. proposed an Urdu speech recognition system designed explicitly for district names of Pakistan [33]. The authors discussed development challenges and solutions and concluded that an accent-independent system performs better for isolated words. Naeem et al. developed an Urdu ASR system using a subspace Gaussian mixture in which all HMM states share the same Gaussian Mixture Model (GMM) structure with the same number of Gaussians in each state [34]. The developed system shows promising results compared to similar statistical approaches. Emond et al. proposed a new metric for performance assessment of code-switched mixed speech called transliteration-optimized WER. They also proposed connectionist temporal classification acoustic model, along with Maximum Entropy (MaxEnt) and LSTM language models for bilingual code-switched Indic languages, including Urdu [35]. Similarly, in [36], authors investigated the effect of data augmentation on code-mixed Bengali-English speech. After data augmentation, they developed and tested MaxEnt and LSTM language models using transliterated Bengali and English corpora.
Ambili et al. proposed a methodology for spoken language identification for Indic languages, which is challenging due to similarities between them. The study compared the performance of featured extraction using three different pre-trained vision models: VGG16, RESNET50, and Inception-v3. These features are then used for spoken language identification using various classification algorithms. Analysis showed that features generated through VGG16 and Inception-v3 resulted in the best accuracy when classified using Artificial Neural Network (ANN) [37]. Rangan et al. proposed a methodology for developing a spoken Language Identification (LID) system using code-mixed (Gujarati, Telugu, and Tamil) with English [38]. The proposed method improved LID accuracy up to 3-5%. Similarly, in [39], Jain et al. investigated the effect of graphemic features on code-mixed Indic-English languages for spoken language identification.
C. Hybrid vs End-to-End Asr Systems
In speech recognition systems, two primary approaches are commonly considered: hybrid systems and end-to-end (E2E) systems. Each approach employs distinct methodologies for recognizing spoken language, and their effectiveness can vary based on factors such as available training data, computational resources, and language characteristics. Hybrid systems consist of multiple components, allowing for the integration of external language models and phonetic resources. E2E systems, on the other hand, utilize a single neural network to directly map speech input to text output, offering a streamlined architecture that simplifies the training process. E2E systems have achieved state-of-the-art performance on numerous benchmarks, particularly when trained on large datasets [40]. Their ability to leverage deep learning techniques can result in impressive accuracy in well-resourced languages and applications. However, their performance can diminish in scenarios with limited annotated data or complex linguistic environments, such as code-switching. In such cases, hybrid systems can excel, particularly by leveraging robust external language models and adapting to low-resource languages [41].
Arif et al. presented a benchmark on End-to-End (E2E) Urdu ASR models by a comprehensive evaluation of various models. The study uses WER as the primary metric for evaluating different models [42]. Mohiuddin et al. developed an end-to-end ASR model for Urdu, leveraging the XLS-R model based on the Wav2Vec2.0 architecture. The studies show that the fine-tuned XLS-R-300M model outperforms other E2E models, achieving a WER of 49%, compared to 57% for wav2vec2-large-xlsr-53 and even higher WERs for Whisper models [43]. However, the reported WER is still high compared to the WER achieved by hybrid systems on even smaller resources. These recent studies suggest that while E2E systems are rapidly improving and becoming more common in the realm of low-resource languages, hybrid models still hold an edge in WER performance specifically in task-specific applications. Hybrid speech recognition systems have distinct advantages in environments where data is scarce or unevenly distributed across languages. Their modular design separates acoustic, language, and pronunciation modeling, which facilitates the integration of domain-specific resources [44]. This is particularly beneficial in code-switching scenarios, where speakers frequently alternate between languages. Studies have demonstrated that hybrid systems can outperform E2E models in these contexts, as E2E systems often struggle to capture the intricate linguistic patterns involved without extensive annotated data [45]. However, it’s essential to acknowledge that E2E models are continuously improving. Advances in self-supervised learning and data augmentation techniques are helping E2E systems become more effective in low-resource scenarios. Moreover, E2E systems can simplify deployment and reduce latency in certain applications due to their unified architecture, which eliminates the need for multiple independent components.
In industrial applications, both hybrid and E2E systems have found success in various sectors such as customer service, healthcare, and legal documentation. For instance, in automated customer support, callers may switch between languages, challenging single E2E models to maintain accuracy across multiple language domains, especially in the case of low-resource languages. Khan et al. developed an ASR system for code-switched Urdu in noisy telephonic environments using a hybrid ASR system combining HHM with CCN-TDNN model [44]. The study demonstrated that for custom use cases especially in a low-resource language with code-switching hybrid systems achieve better performance. Hybrid models can utilize language-specific language models to ensure accurate recognition, while E2E models excel in environments where extensive training data is available to achieve high accuracy [46].
While E2E models may achieve higher accuracy in certain ASR benchmarks, hybrid models are still widely employed commercially due to their adaptability and efficiency. Factors such as streaming capability, latency, and the ability to integrate external resources play a significant role in the decision-making process for deploying ASR solutions [47]. In conclusion, while E2E systems continue to make significant strides in speech recognition, hybrid approaches remain valuable for tackling challenges associated with low-resource settings, industrial applications, and code-switching. To this end, since this paper focuses on an industrial use case, with limited task-specific data and complexity of code-switching hybrid approach is selected for the development of the ASR system. This choice of hybrid approach is further justified in the next section describing the dataset used to develop the ASR system.
Dataset
The proposed ASR system is developed using two distinct datasets: a large-scale Unicode Urdu dataset and a specialized Roman code-mixed address dataset. The first dataset consists of 17,855 recordings from 144 speakers, containing Unicode Urdu transcriptions with a vocabulary size of 28,391 words. This dataset encompasses approximately 61.82 hours of speech data and occupies 7.1 GB of storage space. It is prepared using recordings from various sources, including Urdu news bulletins, talk shows, radio programs, and recordings of Urdu literature. Almost half of this dataset is prepared by transcribing recordings from online sources. The other half is developed by recording various Urdu manuscripts from different sources by speech and language technology research group members, students, and volunteers using the Urdu ASR recording portal developed by “Speech and Language Technology Group”. The extensive nature of this corpus provides a robust foundation for learning the fundamental phonetic patterns and acoustic characteristics of Urdu speech.
The second dataset is more specialized, focusing specifically on code-mixed Urdu-English addresses. It comprises 12,918 recordings from 20 speakers, with transcriptions in Roman Urdu script. This task-specific corpus has a more concentrated vocabulary of 6,194 words, spanning 16.89 hours of audio data and requiring 2.1 GB of storage. The smaller vocabulary size reflects its focused domain of address-related terminology. Both datasets are developed by the “Speech and Language Technology Research Group” at NUST. Transcripts available in both datasets are recorded at 16000 Hz. The regional diversity of the speech dataset is justified by the university’s diverse student body, representing all major regions of Pakistan. This ensures the dataset captures a wide range of accents, dialects, and linguistic nuances. The involvement of students and volunteers allows for continuous updates with new recordings, making the dataset a comprehensive resource for diverse speech patterns across the country. Statistical details of both datasets are described in Table 2.
The unique challenge in this work stems from the different transcription schemes used in the two datasets - Unicode Urdu in the first corpus and Roman Urdu in the address-specific dataset. This diversity in transcription formats necessitated careful consideration of the ASR architecture due limitation of task-specific data. We opted for a hybrid ASR approach over E2E architecture specifically because of its inherent flexibility in handling multiple language modeling paradigms. The hybrid architecture’s modular nature allows independent optimization of acoustic and language models, enabling us to effectively utilize training data transcribed in different writing systems for the same language. This architectural choice was crucial, as E2E ASR systems typically maintain a rigid relationship between input acoustics and output transcription format, making it challenging to incorporate training data with varying transcription schemes. Also, the E2E approach normally requires a large amount of task-specific data Roman Urdu transcribed address in this case which is rare in case of low-resource language such as Urdu where even regular Urdu datasets are scarce. To this end, available task-specific data consisting of only 16.89 hours of recording was not sufficient to develop an efficient E2E model. The hybrid approach’s flexibility enabled us to leverage the phonetic richness of the larger Unicode Urdu dataset for acoustic model training while adapting the language model to handle Roman code-mixed transcriptions for the target address recognition task.
The dual-dataset approach, facilitated by hybrid architecture, strategically addresses the challenge of limited task-specific data in code-mixed address recognition. This training strategy, made possible by the hybrid ASR architecture’s flexibility, effectively leverages the complementary strengths of both datasets: the broad phonetic coverage from the larger corpus and the domain-specific features from the address dataset. The approach is particularly valuable in scenarios where collecting large amounts of task-specific data is challenging or resource-intensive, and where output format requirements differ from available training data transcription schemes. Using the above-described datasets, different versions of ASR systems are developed in this study (system SU and system SM). SU is developed using only general Unicode Urdu data, while updated version SM is developed using mixed data (general Unicode Urdu + English Urdu code-mixed addresses). Table 3 shows different versions of developed ASR systems and dataset/s used for their development. The results of SU inspired the development of system SM, an improvement over the initial system SU. After developing and testing SU, we found some issues, and the system SM and English Urdu code-mixed address dataset were developed to address these issues. We further explain the reason behind the update of SU to SM in section VI the results and discussion. For reset of the paper, models trained for the specific system in Table 3 will be referenced with respective system names (SU or SM). The next section describes various components of a hybrid ASR system, before a deep dive into the development of various ASR systems in this study.
ASR System Components
This section, briefly explains a typical ASR system and its various components. An ASR system estimates the most likely sequence of words for a given speech input. The automatic speech recognition process starts with feature extraction from the input speech. Feature extraction involves applying signal processing to enhance the quality of the input signal and transform input audio from the time domain to the frequency domain. Based on the features extracted, a set of acoustic observations \begin{equation*} \boldsymbol {w^{*}} = \underset {i}{argmax} \left \{{{P(\boldsymbol {O}|w_{i})\:P(w_{i}) }}\right \} \tag {1}\end{equation*}
A. Feature Extraction
Feature extraction, also known as speech parameterization, is used to characterize spectral features of an input audio signal to facilitate speech decoding. Mel-Frequency Cepstral Coefficients (MFCC) introduced by [48] is one of the most popular techniques for feature extraction in speech recognition systems. The reason behind the popularity of MFCC is its ability to mimic the behavior of the human ear. MFCC features can be directly used for speech recognition, but to get better performance, various transforms are applied to the results of MFCC. One of these transformations is Cepstral Mean and Variance Normalization (CMVN) [49]. CMVN is a computationally efficient normalization technique that reduces the effects of noise. Similarly, to add dynamic information to MFCC features, first and second-order deltas can be calculated. Given a feature vector of input observation, O first-order deltas can be calculated.\begin{equation*} \Delta O_{t} = \frac {\sum _{i=1}^{n} w_{i}(O_{t+i} - O_{t-i})}{2 \,\sum _{i=1}^{n} w_{i}^{2}} \tag {2}\end{equation*}
\begin{equation*} \Delta ^{2} O_{t} = \frac {\sum _{i=1}^{n} w_{i}(\Delta O_{t+i} - \Delta O_{t-i})}{2 \,\sum _{i=1}^{n} w_{i}^{2}} \tag {3}\end{equation*}
After first and second-order delta calculation, the combined feature vector becomes\begin{equation*} \Delta O_{t} = [O_{t} \quad \Delta O_{t} \quad \Delta ^{2} O_{t}] \tag {4}\end{equation*}
Other feature transformation techniques used in speech recognition are Linear Discriminant Analysis (LDA) [50], Heteroscedastic Linear Discriminant Analysis (HLDA) [51], Maximum Likelihood Linear Transform (MLLT) [52] and Feature space Maximum Likelihood Linear Regression fMLLR [53]. fMLLR can also be used as feature extraction for Speaker Adaptive Training (SAT) [54]. These transforms can be applied individually as well as in combination and can significantly enhance the performance of a speech recognition system. The implementation details of these techniques can be found in the respective papers.
B. Language Model
A language model used in ASR systems helps understand the structure of the language. A corpus is prepared to generate a language model based on desired transcripts. The language model contains the likelihood of the co-occurrence of words in the vocabulary. It determines \begin{equation*} P(\boldsymbol {w^{*}}) = \prod _{i=1}^{n} P(w_{i}|w_{1}, \ldots, w_{i-1}) \tag {5}\end{equation*}
C. Acoustic Model
Acoustic Modeling in the ASR system estimates equation
GMM is a statistical generative model that is a common choice to estimate the distribution of output observations, while HMM is used to model temporal variability of speech. A combination of both models creates a joint acoustic model capable of describing the temporal and spectral dynamics of the speech. Figure 2 shows the architecture of an arbitrary GMM-HMM model for speech recognition where
Architecture of left to right GMM-HMM acoustic model with k states for speech recognition.
Deep Neural Networks (DNNs) are an alternative to the Gaussian mixture models [55]. A DNN is a neural network with more than one hidden layer between the input and output layers. In DNN, each hidden unit or neuron uses a logistic function that could be closely related to a hyperbolic tangent or any other function with a well-behaved derivative. In the case of multi-class classification problems like speech recognition, input observation is converted into class probabilities using the softmax activation function. An essential feature behind the success of DNNs in speech recognition is their ability to be trained discriminatively using back-propagation of cost function derivatives measuring the difference between the actual and predicted output for each training case. The natural cost function for softmax output is the cross entropy between the probabilities and output of the softmax. Figure 3 shows an arbitrary DNN-HMM architecture used for speech recognition tasks. The figure presents the model’s components and data flow, from input features to output probabilities. At the bottom of the figure, we have observations presented as a series of feature frames (
D. Phonetic Dictionary
The phonetic dictionary contains the mapping of words to respective phones. Phonetic dictionaries for both datasets are prepared separately. Around 20 % of the vocabulary for both datasets is manually converted to phones, and the rest is converted using the sequitur Grapheme-to-Phoneme (G2P) model [56]. Sequitur G2P is a data-driven technique to solve monotonous sequence translation problems (like word-to-phone conversion). Sequitur G2P has no built-in language specification and can be used for any language; provided example pronunciations for training G2P model. Training the grapheme-to-phoneme model makes the conversion process very fast. To prepare a G2P model, manually converted examples (pronunciation dictionary) of words to phones are used. Each line in the training dictionary has one word followed by its pronunciation. After training the G2P model, it can generate pronunciations of the remaining words in the vocabulary. For this study, we developed two G2P models, one for general Unicode Urdu data and the other for code-mixed Urdu-English address data. The conversion accuracy was manually inspected, and minor corrections were performed where required.
Experimental Setup
Various acoustic models are developed and tested during the proposed ASR system development. All the models described in the following sections are trained for both systems (SU and SM). We trained seven (M1 to M7) different types of models to identify the best model for Urdu-English code-mixed address recognition. After completing the training, both systems (SU and SM) are extensively tested using different test sets and language models for performance assessment.
A. GMM-HMM
A phone is the smallest unit of speech. Each word consists of a sequence of phones. The number of phones in a language is far less than the number of unique words. If we use words as a training unit, the model must know each word in a language, making the problem’s dimensionality too high to handle. Words in the speech transcripts are converted to phones using a phonetic dictionary which contains phones against the words from the vocabulary. In addition to the dictionary, Out-of-Vocabulary (OOV) words are converted to phones using the G2P model, which is trained using manually converted words. A monophonic acoustic model is a model trained on individual phones. A better approach compared to mono phone modeling is triphone modeling. A triphone is a sequence of three phones, and it captures the context of the phone in the middle very efficiently. If there are N base phones, there are
Table 5 shows various GMM-HMM models trained for this study with respective feature transforms. Further details on these techniques and transformations can be found in the respective papers, specifically in the section IV-A. For all triphone models, total leaves are set to 2000 total Gaussian is set to 11000 while training the model. The main drawbacks of using GMM-HMM are the assumptions we make when modeling speech. On the other hand, discriminatory training methods do not make any assumptions about the distribution of training data. It is one of the primary reasons behind the success of discriminative training algorithms, making them principal training algorithms in speech modeling.
B. DNN-HMM
Different deep neural network (DNN) architectures are designed and tested in the literature on speech recognition tasks. In this paper, the developments focus on M6 (FNN) and M7 (TDNN-LSTM).
1) Feedforward Neural Network (FNN)
The first DNN trained for this study is an FNN model consisting of an input layer, six hidden layers, and an output layer. Figure 4 and 5 show the architecture of trained FNN. To train an FNN model, we first need to transform our speech data into features. The length of the feature vector is 140, which is achieved after applying various transforms on the raw speech signal. Figure 5, shows a step-by-step process to transform raw speech into a compressed representation of size 140. This transformation requires various signal-processing steps. The feature conversion pipeline processes the input signal into frames of 25 milliseconds. Each frame is converted to an MFCC feature vector of size 13, spliced with ±4 to generate
2) Time Delay Neural Network - Long Short-Term Memory (TDNN-LSTM)
The second DNN trained for ASR development is TDNN-LSTM, which combines TDNN and LSTM layers to form a hybrid neural network architecture. Figure 7 and 6, illustrate the architecture and feature pipeline of TDNN-LSTM developed in this study. The input features are processed through a sophisticated pipeline before feeding into the network. Speech signals are processed in overlapping windows of 25ms with a 10ms shift to extract 40-dimensional MFCC features. These MFCC frames are concatenated in groups of N=5 frames. Additionally, a 100-dimensional ivector is computed over N concatenated MFCC frames and transformed using Linear Discriminant Analysis (LDA) to capture speaker and channel characteristics. Combining these features results in a 300-dimensional final feature vector that serves as input to the network. The architecture of TDNN-LSTM consists of 6 TDNN blocks interleaved with 3 LSTM blocks, arranged in a hierarchical structure to capture both temporal and sequential patterns. Each TDNN block follows a consistent structure comprising three layers: an affine transformation layer for linear projection, followed by a Rectified Linear Unit (ReLU) activation function for non-linearity, and a batch normalization layer to stabilize training. The network processes the input features as follows:
The first TDNN block linearly transforms the 300-dimensional input through its standard three-layer structure.
The second TDNN block operates with a context length of ±1, effectively processing information from adjacent frames (
, t,t-1 ), and is followed by a unidirectional recurrent LSTM layer block that maintains hidden states between time steps.t+1 The third and fourth consecutive TDNN blocks follow the first LSTM layer. While structurally identical to the second TDNN block, these blocks expand their temporal context to ±3 frames, allowing them to model longer-range temporal dependencies.
A second LSTM block, identical in structure to the first LSTM block, processes the output of the fourth TDNN block, further enhancing the network’s ability to capture sequential patterns.
The fifth and sixth TDNN blocks maintain the expanded context length of ±3 and process the output from the second LSTM block.
The final (third) LSTM block concludes the main processing pipeline.
The output layer of the network operates on 256-dimensional input provided by the third LSTM block. It consists of an affine transformation followed by logarithmic scaling and a softmax operation to generate posterior probabilities for the acoustic states. This final stage maps the network’s internal representations to phonetic state probabilities used in the hybrid ASR system. TDNN-LSTM is trained for six epochs [58], allowing sufficient time for the network to learn both short-term acoustic patterns through its TDNN components and long-term dependencies through its LSTM layers.
C. Computation Time
The evaluation of various acoustic models in Table 6 highlights key trends in computational requirements across traditional GMM-based systems (M1 to M5) and neural networks (FNN and TDNN-LSTM), for system SM, trained on 78.7 hours of speech data As models progress from monophone (M1, M2) to more complex triphone configurations (M3 to M5), training times increase significantly, from 1.7-1.8 hours for monophone models to 5.8-9.3 hours for basic triphones, and up to 24.6 hours when using SAT. DNNs, despite GPU acceleration (Titan X - 12GB of 10Gbps GDDR5X VRAM), require much longer training times 30.2 hours for FNN and 68.5 hours for TDNN-LSTM due to their complexity and amount of training data.
The Real-Time Factor (RTF) measures the ratio of processing time to actual audio duration. It is defined in equation 6 below.\begin{equation*} \text { RTF} = \frac {\text {Time taken for processing}}{\text {Duration of the audio}} \tag {6}\end{equation*}
An
RTF is an important factor to be considered during the real-time deployment phase. In this study, RTF is only measured for the final deployed model (TDNN-FNN) which shows promising inference efficiency. The TDNN-LSTM achieves 0.3-0.5 (2-3.3 times faster than real-time), indicating that despite the longer training times, neural models are practical for real-time applications.
D. Performance Assessment
A typical performance measure used to compare different ASR models is the percentage Word Error Rate (WER). WER is calculated using Levenshtein distance between words [59]. In Levenshtein distance, we count the number of insertions, substitutions, and deletions performed to equal two-word sequences. Depending upon the problem, the cost of insertions, substitutions, and deletions can be set. By default, this cost is similar for all operations and is set to 1. When the reference (ref) transcript is matched with the hypothesis (hyp), each word in the hypothesis is assigned a respective label based on whether it is an insertion (I), substitution (S), deletion (D), or correct (C). Equation 7 presents the formula for calculating WER. WER is calculated after each alignment during the decoding process.\begin{equation*} \text { WER }(\%) = \frac {S_{t} + D_{t} + I_{t}}{N}\times 100 \tag {7}\end{equation*}
In equation 7, \begin{equation*} \text { WER }(\%) = \frac {1 + 0 + 0}{1 + 0 + 4}\times 100 = 20\,\%\end{equation*}
Other metrics that complement the widely used WER to evaluate the performance of ASR systems are Character Error Rate (CER) and Sentence Error Rate (SER). CER measures the edit distance between the recognized text and the reference text at the character level. It is particularly useful for languages with logographic writing systems or when dealing with continuous speech without clear word boundaries. CER is expressed as a percentage, with lower values indicating better performance. Equation 8 shows the formula to calculate CER, where \begin{equation*} \text { CER }(\%) = \frac {S_{t} + D_{t} + I_{t}}{N}\times 100 \tag {8}\end{equation*}
Using equation 8, CER for the sample in Table 7 can be calculated where, \begin{equation*} \text { CER }(\%) = \frac {1 + 0 + 0}{1 + 0 + 34}\times 100 = 2.86\,\%\end{equation*}
SER on the other hand quantifies the proportion of sentences that contain at least one error in the ASR output compared to the reference transcription. This metric is valuable for assessing the overall accuracy of the system at the sentence level, which is crucial for many applications requiring precise sentence-level understanding.\begin{equation*} \text { SER} = \frac {\text {Number of sentences with at least one error}}{\text {Total number of sentences}} \tag {9}\end{equation*}
These metrics explained above are very intuitive and straightforward measures that can compare different ASR models. In this paper, these metrics are used as a performance metric to analyze and compare the efficiency of various ASR models trained and tested during the development of the ASR system. In addition to these performance metrics manual analysis of the decoded transcripts is performed to analyze the model’s capabilities and limitations. This analysis also helped identify some areas of improved as future work of the study.
Results and Discussion
During the development phase, various acoustic models for ASR systems (SU and SM) are trained. After training, extensive testing is performed to investigate the efficiency of these systems. The following sections explain the testing results and the reason behind the development of different systems.
A. System SU Testing
Initially, system SU is developed using Unicode Urdu data. After the development, we tested all models (described in section V) using Addresses LM since the system was developed for code-mixed address recognition. Even though there is a difference in the alphabets of Unicode Urdu data and code-mixed addresses, the same acoustic model can be used as if both datasets contain similar phones. This is because acoustic models are trained on phones instead of textual representations and can be represented by word counterparts as far as they represent the same phones. Table 8 shows the system SU results for different models described in the experimental setup using Addresses LM on 900 random Urdu-English code-mixed test addresses. Results in Table 8 show, that there is a clear performance improvement as we move from simpler models (monophone) to more sophisticated ones (TDNN-LSTM). This trend is consistent across all three evaluation metrics (WER, CER, and SER). The transition from monophone (M1) to triphone models (M2-M5) shows a substantial reduction in error rates. The WER decreases from 49.42% to 19.64%, demonstrating the advantage of context-dependent acoustic modeling in capturing coarticulation effects prevalent in code-mixed speech. The FNN (M6) and TDNN-LSTM (M7) models significantly outperform traditional GMM-based models. The TDNN-LSTM model achieves the best results with a WER of 12.29%, CER of 2.46%, and SER of 40.82%. This substantial improvement underscores the importance of advanced modeling techniques for this challenging task. This also highlights the effectiveness of deep learning techniques in modeling the intricate patterns of code-mixed speech.
1) Real-Time System Testing
The performance of the system SU was satisfactory and after initial testing, the best model was hosted on the TPL server. After the deployment the model was tested on real-time voice queries from consumers using the TPL maps mobile app. The decodings against the recorded queries were observed by TPL staff for a few months. This analysis showed a critical problem in the system SU. While SU could decode most addresses correctly, we noticed that it did not correctly decode English numbers. This was because SU was trained on general Unicode Urdu data with little coverage for code-mixing. As numbers are expected in addresses, this was causing performance issues. We updated system SU to SM to resolve this problem. The development and testing process for system SM is described in the following sub-section VI-B.
B. System SM Testing
To resolve the English number recognition problem in the system SU, a new dataset of code-mixed addresses is developed and described in III. After dataset development, we used combined data (Unicode Urdu + Code mixed addresses) to train all acoustic models described in section V to develop an updated ASR system (SM). After training, acoustic models are tested using Addresses LM on the same test set as system SU. After this testing, it was observed that the system SM started recognizing English digits properly, and WER was also considerably improved. Table 9 shows performance of models M1 to M7 developed for system SM on same 900 test addresses. These results demonstrate a significant improvement in performance across all acoustic models compared to the previous system SU. This enhancement can be attributed to the incorporation of code-mixed addresses in the training data, addressing the English number recognition problem previously encountered. The most striking observation is the substantial reduction in error rates across all models. Even the simplest model (M1, monophone) achieves a WER of 12.12%, which is notably better than the best-performing model in the previous system SU (12.29% for TDNN-LSTM, as shown in Table 8 of the previous discussion).
Model progression is similar to the previous system, there is a clear trend of improvement as we move from simpler models to more complex ones. The progression from monophone (M1) to Triphone (M2-M5) and then to neural network-based models (M6-M7) shows consistent error rate reduction. The triphone models (M2-M5) show significant improvements over the monophone model, with WERs ranging from 6.46% to 7.20%. This underscores the importance of contextual information in modeling code-mixed speech. The FNN (M6) and TDNN-LSTM (M7) models again demonstrate superior performance. The TDNN-LSTM model achieves the best results with a WER of 4.02%, CER of 0.80%, and SER of 15.14%. This represents a substantial improvement over the previous system’s best performance. The substantial improvement across all models emphasizes the critical role of diverse and representative training data. By combining Unicode Urdu and code-mixed addresses, the system gained robustness in handling the variability present in real-world code-mixed speech. After identifying the best model for system SM based on testing results, the new model is deployed on the TPL server for future queries. We also tested system SM on additional test scenarios to ensure the developed system is fully equipped to handle speech in the target accent.
1) New Speakers Testing
All the testing/decoding using system SM is performed on the unseen testing data of speakers whose speech data is included in the training data. To test the system’s robustness and performance on the new speakers (out-of-training speakers) data, SM models are tried on 50 random address transcripts from 4 unknown speakers. Performance results for the models M1 to M7 on these random speaker’s transcripts are presented in Table 10.
These results provide valuable insights into the generalization capabilities of system SM when tested on new, unseen speakers. This evaluation is crucial for assessing the robustness and real-world applicability of the speech recognition system. Overall performance on new speakers remains good, with the best model (M7, TDNN-LSTM) achieving a WER of 2.13%, CER of 0.43%, and SER of 8.25%, indicating the strong generalization capabilities of the system. Model progression is similar to the results on known speakers, there is a clear trend of improvement from simpler models to more complex ones. The progression from monophone (M1) to triphone (M2-M5) and then to neural network-based models (M6-M7) shows consistent error rate reductions. The FNN (M6) and TDNN-LSTM (M7) models again demonstrate superior performance. The TDNN-LSTM model achieves the best results across all metrics, showing its ability to capture complex speech patterns and generalize well to unseen speakers. The superior performance of neural network-based models, especially the TDNN-LSTM, on new speakers underscores their ability to learn more generalizable representations of speech compared to traditional HMM-based approaches. It also shows the system’s ability to maintain high accuracy on unseen speakers indicates its readiness for real-world deployment, where it will inevitably encounter speech from unknown individuals. Results indicate that acoustic models of the system SM are speaker-independent, and there is no noticeable decrease in performance when tested on new speakers.
2) Accent-Based System Testing
After testing on speakers, not in the training set, we focused on accent-based testing of our final system (SM). Usually, speech recognition systems trained for a language with a specific accent do not perform well with different accents of the same language. We tested the system on two accent-based test scenarios to justify the need for a specialized model for target accents and test our system. First, we used LibriSpeech, a famous speech dataset recorded in the American accent. We recorded random transcripts from the LibriSpeech dataset in the target accent by different speakers. The same transcripts in American and accents are tested on our model and pre-trained LibriSpeech counterpart models using LibriSpeech LM. The comparative analysis presented in Tables 11 and 12 offers compelling evidence for the efficacy of accent-specific speech recognition models. The proposed system SM, developed for Pakistani-accented English, demonstrates superior performance on its target accent compared to the LibriSpeech pre-trained models. For the triphone model M5, SM achieved a WER of 12.46% on Pakistani-accented speech, significantly outperforming the M51 model from LibriSpeech (65.19%). Similarly, for the FNN models (M6), SM attained 7.26% WER on Pakistani-accented speech, compared 25.32% using M62 from LibriSpeech. Conversely, on American-accented speech, the LibriSpeech models maintained their superiority, with the FNN model achieving an 8.39% WER versus SM’s 47%. Other performance metrics (CER and SER), show a similar trend to WER performance.
These results underscore the critical importance of accent-specific training in automatic speech recognition systems. The performance disparities observed between accents for both systems highlight the challenges of cross-accent generalization and justify the development of specialized models for different accents, even within the same language. Furthermore, the consistent outperformance of FNN models over their triphone counterparts suggests that more advanced neural network architectures can better capture accent-specific characteristics. This study not only validates the approach taken in developing the SM system but also emphasizes the need for diverse, accent-specific corpora in building robust speech recognition systems for varied linguistic communities. Future research directions may include exploring more sophisticated model architectures, investigating transfer learning techniques for efficient accent adaptation, and developing multilingual and multi-accent systems to address the complexities of diverse linguistic environments.
Secondly, this accent testing is further extended by testing system SM and LibriSpeech models on data from YouTube recodings. The results presented in Tables 13 and 14 comprehensively evaluate the proposed system SM and LibriSpeech pre-trained models on real-world Pakistani-accented English speech collected from YouTube. This analysis, conducted across four speakers with 363 test audios, provides valuable insights into the system’s robustness and generalization capabilities in handling naturally occurring Pakistani-accented English speech. For the triphone models (M5), as shown in Table 13, the LibriSpeech model struggled significantly with Pakistani-accented speech, yielding high error rates across all speakers (ranging from 58.57% to 81.91% WER) with an average WER of 66.62%. In contrast, the proposed SM model demonstrated substantially better performance, achieving an average WER of 21.18%, with individual speaker WERs ranging from 9.30% to 29.15%. The corresponding CER and SER metrics show similar improvements, with the SM model (M5) achieving an average CER of 4.24% compared to LibriSpeech’s 13.32%. The FNN models (M6), presented in Table 13, show even more promising results. While the LibriSpeech FNN model performed better than its triphone counterpart, achieving an average WER of 38.22%, it still fell short of the SM FNN model’s performance. The proposed SM FNN model maintained consistent performance across all speakers, achieving an average WER of 20.03%, with individual speaker WERs ranging from 17.05% to 23.83%. This represents a significant improvement over the LibriSpeech model’s performance, which showed higher variability across speakers (WERs ranging from 30.95% to 49.61%).
These results are particularly noteworthy given that the test data was collected from YouTube content featuring Pakistani celebrities and politicians, representing natural, uncontrolled speech conditions. The consistent superior performance of both SM models (triphone and FNN) across all speakers demonstrates the system’s robustness in handling real-world Pakistani-accented English speech. Furthermore, the reduced variability in error rates across speakers for the SM models suggests better speaker independence compared to the LibriSpeech models. The performance gap between the LibriSpeech and SM models is more pronounced in the triphone architecture compared to the FNN architecture, suggesting that while more advanced neural architectures can better handle accent variations, the benefits of accent-specific training remain substantial regardless of model architecture. These findings further reinforce the importance of developing specialized acoustic models for specific accent groups and validate the effectiveness of the proposed approach in addressing the challenges of accent-specific speech recognition. Results indicate that system SM models performed better than LibriSpeech models on English data recorded in the Pakistani accent. This shows the significance and the need for ASR systems based on Pakistani/South Asian accents.
3) Manual Decoding Analysis
An examination of the recognition errors reveals several distinct patterns that provide insights into the developed ASR system’s strengths and areas requiring improvement. We categorize and analyze these patterns to understand the underlying challenges in code-mixed address recognition in Table 15. The most prevalent category of errors stems from phonetic similarities between words, particularly in cases where the system substitutes words with similar pronunciations. In the case of “madina masjid” being recognized as “dina masjid”, the system fails to distinguish the subtle phonetic difference in the initial syllable. Similarly, “nirala sweets” being recognized as “rana sweets” shows confusion in words with similar phonetic patterns but different initial consonants. The substitution of “zeeshan” with “shan” indicates a challenge in recognizing word-initial phonemes, particularly in cases where the dropped syllable creates another valid word. The substitution of “do” with “2” indicates a challenge in distinguishing between numeric and word representations of numbers. Confusion between similar-sounding words like “main/mian” and “maul/north” suggests a need for better modeling of subtle phonetic distinctions.
The system shows good performance in handling alphanumeric components in most cases. Strong performance is observed in recognizing pure numeric sequences, as evidenced by the accurate recognition of “chak 33 3 r” in the rural health center address. The system successfully recognizes alphanumeric combinations like “a 1” in “a 1 pathology lab”, indicating robust handling of alphanumeric patterns. The system demonstrates interesting patterns in handling code-mixed elements. The system successfully recognized code-mixed phrases like “milk point” in “bilal milk point”. Shows consistency in maintaining English terms in their correct form when they are common in addresses (e.g., “center”, “workshop”, “system”). However, the system struggles with less frequent English terms, as seen in the misrecognition of “transports” in the “goods transports” address. Analysis of these examples provides valuable insights into the ASR system’s strengths and weaknesses. While the model demonstrates good performance in many areas, these specific misrecognitions highlight potential avenues for improvement, particularly in handling phonetic similarities, context awareness, and less frequent term scenarios. Addressing these issues through targeted training and dataset augmentation can enhance the robustness and accuracy of the speech recognition system. Future work describes in more detail how the developed ASR system can be improved further given the insights from the performance analysis.
This study is a joint effort of the National University of Science and Technology (NUST) and TPL Maps3 (a part of the TPL Corp) to develop Pakistan’s first voice-enabled navigation service for code-mixed Urdu-English addresses. The developed speech recognition systems can have several implications, including but not limited to (i) Specialized application: Most speech recognition systems are developed for general speech with little focus on the specialized task. This study presents application-specific speech recognition systems. The results could be used as a reference to improve code-mixed street address recognition for under-resource languages, (ii) Language-specific solution: This study presents a language-specific solution, which could be used as a reference for developing similar solutions for code-switched speech data, (iii) Multi-script speech corpora: A unique way of combining speech data transcribed in different scripts to develop a better-performing system is proposed and tested in this study. This subject is not well explored as many languages have such data but are not utilized efficiently, and (iv) Social impact: The availability of the developed solution as part of the free Maps application could have a positive social impact by serving millions of consumers and creating a more inclusive online environment. Future work will explore extending this system to include other regional accents and languages, as well as comparisons with other foreign accents, such as British English, etc., to enhance its generalizability,
Conclusion and Future Work
In this study, we developed a novel ASR system for Urdu-English code-mixed street address recognition and accent adaptation for a real-time application. Based on the literature analysis for low-resource and task-specific scenarios and for industrial applications the hybrid ASR systems often outperform E2E models by leveraging domain-specific acoustic and language models, ensuring greater accuracy and reliability in challenging environments. To this end, the hybrid ASR approach is selected over the E2E approach enabling the ASR system to handle datasets transcribed in two different Urdu writing styles, expanding its versatility. The system was developed using the Kaldi ASR toolkit and employed two distinct acoustic modeling techniques: GMM and DNN. Both GMM-HMM and DNN-HMM acoustic models were rigorously tested, and it was observed that models trained using deep neural networks significantly outperformed traditional Gaussian mixture models in all testing scenarios. Besides regular testing, additional out-of-training speaker testing is performed on the developed ASR system. Furthermore, extensive accent-based testing revealed the limitations of widely used pre-trained models on Pakistani-accented English data, underscoring the necessity for specialized accent-specific ASR models for low-resource languages. Results clearly indicate that accurate code-mixed address recognition is more efficiently achieved when using audio data recorded in local accents as opposed to foreign-accented audio. This highlights the critical need for such systems in real-world applications. This work represents the first large vocabulary continuous speech recognition system developed for code-mixed Urdu-English voice-activated navigation, marking an important step toward enhancing Urdu speech recognition in practical settings.
For future work, we aim to further optimize our DNN-HMM acoustic models by training them exclusively on Unicode Urdu data, to enhance performance in large vocabulary speech recognition tasks. Our research group also plans to expand and diversify our dataset to ensure it is more representative of the varied linguistic landscape. Once we have gathered sufficient data, we intend to explore end-to-end (E2E) ASR approaches, conducting a comparative analysis with the hybrid ASR method utilized in this study. This comparison will provide insights into the advantages and limitations of each approach in our specific context. Additionally, we see significant potential in adapting the developed ASR system for specialized applications, such as real-time Urdu speech transcription and automated mailing address recognition. This adaptation will facilitate seamless input in mailing systems and similar use cases, ultimately improving user experience and accessibility. By pursuing these avenues, we aim to contribute further to the advancement of ASR technologies tailored for under-resourced languages and applications.