Bilingual Automatic Speech Recognition: A Review, Taxonomy and Open Challenges

In this technological era, smart and intelligent systems that are integrated with artificial intelligence (AI) techniques, algorithms, tools, and technologies, have impact on various aspects in our daily life. Communication and interaction between human and machine using speech become increasingly important, since it is an obvious substitute for keyboards and screens in the communication process. Therefore, numerous technologies take advantage of speech such as Automatic Speech Recognition (ASR), where human natural speech for many languages is used as the means to interact with machines. Majority of the related works on ASR concentrate on the development and evaluation of ASR systems that serve a single language only, such as Arabic, English, Chinese, French, and many others. However, research attempts that combine multiple languages (bilingual and multilingual) during the development and evaluation of ASR systems are very limited. This paper aims to provide comprehensive research background and fundamentals of bilingual ASR, and related works that have combined two languages for ASR tasks from 2010 to 2021. It also formulates research taxonomy and discusses open challenges to the bilingual ASR research. Based on our literature investigation, it is clear that bilingual ASR using deep learning approach is highly demanded and is able to provide acceptable performance. In addition, many combinations of two languages such as Arabic-English, Arabic-Malay, and others, are still limited, which can open new research opportunities. Finally, it is clear that ASR research is moving towards not only bilingual ASR, but also multilingual ASR.


I. INTRODUCTION
Automatic Speech Recognition (ASR) application can be monolingual that particularly serves one language such as Arabic, bilingual that serves a combination of two languages such as Arabic-English, and multilingual that serves a combination of three or more languages such as Arabic-English-Malay. Obviously, the more languages the ASR application The associate editor coordinating the review of this manuscript and approving it for publication was Md. Moinul Hossain . supports, the harder and more complex the recognition task becomes.
In this advanced technological era, bilingual ASR tasks become essential due to their importance and high demand in our daily life. Speech is the most efficient means for human communication and interaction. There are many languages in the world, which can be approximately 6900 distinct languages [1]. This huge number of spoken languages can be intertwined and must be supported by ASR technology by having more than one input language and more than one output language of a single ASR engine [1], [2], [3], [4], [5].
Bilingual refers to the people who can speak two languages fluently. Therefore, bilingualism occurs due to bilinguals' exposure of speaking two languages starting from their birth and continually speaking the two languages throughout their lives. From ASR perspective, a bilingual system is capable of identifying two different languages from a single audio source [6]. Furthermore, bilingual is now a widespread phenomenon among several bilingual and multilingual societies. Some minority languages are influenced by majority language(s) that have been influenced by globally influential languages such as English, Mandarin, and French [1], [7].
Bilingual ASR engines with either code-switching or code-mixing is useful in recognising spoken sentence/utterance that contains multiple languages. For example, the Arabic-English sentence ''mall '', which can be translated into the sentence ''I want to go to the mall'' in English. Code switching is defined as switching languages from one sentence to another. For example, the Arabic-English sentence ' ' , make sure to be on time'', which can be translated into the sentence ''Your appointment is at four o'clock in the afternoon, make sure to be on time'' [1], [11], [12], [13], [19].
Based on our literature investigation, it is found that majority of the related works focus on the development and evaluation of ASR systems that only serve a single language (monolingual). Furthermore, research attempts that combine multiple languages (bilingual and multilingual) during the development and evaluation of ASR systems, are very limited. Therefore, the major contribution of this paper is to provide comprehensive research background and fundamentals of bilingual ASR, and to summarise bilingual ASR related works and initiatives that have been published from 2010 through 2021. This paper also formulates research taxonomy and discusses open challenges to bilingual ASR research, which function as a roadmap and a guide to the research community.
The rest of the paper is organised as follows: Section 2 highlights the research background on bilingual ASR including its architecture, requirements, components and dominating toolkits. Section 3 presents the method of data collection process and distribution results. Section 4 presents the results and discussion of the related works on bilingual ASR. In addition, Section 5 highlights the open issues and challenges for bilingual ASR, whereas Section 6 finally presents the conclusion.

II. BACKGROUND
This section is essential since it provides sufficient research background and fundamentals of bilingual ASR. The architectural design of bilingual ASR systems that includes the development and evaluation requirements and components are addressed in this section, which includes speech and text resources, feature extraction, acoustic modelling, language modelling, pronunciation dictionary, and decoder.

A. ARCHITECTURAL DESIGN OF BILINGUAL ASR SYSTEMS
The design, development, and evaluation of the contemporary bilingual ASR systems undergo multiple phases and processes including feature extraction, acoustic modelling, language modelling, pronunciation dictionary, and decoding, as shown in Fig. 1. A bilingual ASR system can be implemented by extending each component or model of ASR system from monolingual to bilingual [19], [20], [21]. All models that are used for implementing ASR remain the same for bilingual ASR.
Speech signal and feature extraction are common for all bilingual ASR systems [3], [7]. The recorded speech signal is an analog signal, which needs to be transformed into digital signal by the ASR engine. During the feature extraction phase, the set of parameters or best known as features of utterances that have acoustic correlation with the speech signals are identified. Features are computed by processing the acoustic waveform. In addition, feature extraction process keeps relevant information from the acoustic waveform and discards any irrelevant information. The most frequently used feature extraction techniques for bilingual ASR systems are Mel-Frequency Cepstral Coefficient (MFCC), Perceptual Linear Prediction (PLP), and Linear Discriminant Analysis (LDA) [3], [7], [22], [23].
During the acoustic modelling phase, the connections between the acoustic information and phonetics are established, which form the acoustic model (AM). AM is trained to establish correlations between the basic speech units and the acoustic observations. Training the system requires creating a pattern that is representative for the features of a class using one or more patterns that correspond to speech sounds of the same class. The most dominating techniques for acoustic modelling include Hidden Markov Model (HMM) and Deep Neural Network (DNN) [3], [7], [21], [22], [23], [24].
A language model (LM) is an important component of any bilingual ASR, which contains the structural constraints available in the language to generate the probabilities of occurrence. It induces the probability of a word occurrence after a word sequence. For language modelling, bilingual ASR systems normally use the N-gram approach that includes uni-grams, bi-grams, and tri-grams, which help in finding the correct word sequence by predicting the likelihood of the n th word, using the n-1 prior words. The SRI language modelling (SRILM) toolkit is the most popular toolkit used for language modelling tasks [6], [7], [13], [16], [17], [23], [25].
A lexicon which is known as pronunciation dictionary, is another essential requirement for bilingual ASR systems. It includes the pronunciation variations of each word in a language. Finally, a decoder is used to evaluate the bilingual ASR engine by comparing a new speech waveform against three essential components, namely the acoustic model, the language model, and the pronunciation dictionary. The output of the decoder is normally the word string. Popular ASR toolkits include KALDI, HTK and CMU-sphinx. [3], [6], [7], [8], [14], [16], [17]. The most popular toolkit for the bilingual ASR systems is KALDI.

B. SPEECH AND TEXT RESOURCES
In any bilingual ASR system, speech and text data are mandatory for implementing the needed models. Though a larger amount of data will be better for implementing bilingual ASR models, it requires a pre-processing phase to adjust the data and make it compatible with the bilingual ASR models. Pre-processing phase is time consuming, where some sound filters are applied over the speech data and normalisation is applied over the text data (cleaning data). Consequently, each speech file must have its corresponding text file, which contains the content of the spoken file in text format.
There are many spoken languages in the world, which can be tackled and served through bilingual ASR systems. There are a few large data centres and distributors that focus on language resources (LR) creation and distribution such as Linguistic Data Consortium (LDC), European Language Resource Association (ELRA), Open Language Archives Community (OLAC), and National Institute of Information and Communication Technology (NICT). There are also research projects such as  [5], [26], [27], [28], [29], [30], [31].

III. METHOD
This review aims at summarising and analysing related literature generally on ASR and particularly on bilingual ASR, which consists of conference and journal articles. It also aims to apply prior knowledge and ideas acquired on ASR and its components to discuss difficulties in bilingual ASR development. The taxonomy of bilingual ASR, the dominating techniques, and the most popular ASR toolkits are also highlighted in this review. This paper provides comprehensive research background and fundamentals of bilingual ASR, and related works that combined two languages for ASR tasks.
It also discusses open challenges, highlights future research trends and directions, and draws a roadmap to bilingual ASR research.
Based on our literature investigation and to the best of our knowledge, our review paper is the first review on bilingual ASR, which formulates the contribution of this review to the research community. This paper carries out an extensive literature review in order to bridge the research gap and focuses on the related literature published within the period of 2010 to 2021. In addition, it provides a comprehensive discussion on the fundamentals of bilingual ASR, including its architecture, challenges, languages covered, and their available databases.

A. DATA COLLECTION PROCESS
In this study, we used specific keywords to search the related literature. The search criteria included the process of finding papers using the following keywords: bilingual automatic speech recognition (ASR), cross lingual, code switching and code mixing.
To acquire a wide scope of relevant journal articles, the search covered popular databases in the field such as Springer Link, IEEE Xplore Digital Library, Science Direct, Association for Computing Machinery (ACM) and ISI Web of Science.
The inclusion criteria included journal papers which were written in English language from engineering and computer science disciplines. Sub-disciplines such as speech communication, artificial intelligence, computer science, and computational linguistics were taken into consideration. The processes involved in the collection of articles is shown in Fig. 2.  Table 1. The duplication with other databases is summarised in Table 2. From Table 1, 39 articles were relevant in our final review. However, from Table 2, 11 articles appeared in different databases. Thus, 25 articles were included in the final review. Fig. 3 summarises the inclusion and exclusion criteria for the selection of articles in our final review, where 1757 articles were excluded since they are not related to Engineering and Information Technology fields. Other references were the papers included in the introduction, speech and text resources or the articles which were cited by   the main references. References [3], [20], [24], [31], [33], [34], [35] appeared only in ISI Web of Science.

IV. RESULTS AND DISCUSSION
Based on our literature investigation, it is found that majority of the related works focus on the development and evaluation of bilingual ASR systems that only serve a single language (monolingual) such as Arabic, English, Chinese, French, Russian, and Spanish. Furthermore, research attempts that combine multiple languages (bilingual and multilingual) during the development and evaluation of bilingual ASR systems, are very limited compared to monolingual ASR. Several works have been done over bilingual ASR which we may come up with an excellent result to describe the taxonomy, softwares and toolkits used, techniques for implementing the related bilingual ASR models with a notable recognition accuracy.

A. TAXONOMY OF BILINGUAL ASR
Bilingual ASR is classified into code switching (intersentential) and code-mixing (intra-sentential). Code mixing occurs when the speech contains both languages within the same sentence. On the other hand, in a code switching, the speaker switches sentences using both languages [1], [19], [20], [22], [31]. The taxonomy of bilingual ASR as presented in Fig. 4 is based on language dependency, phone sets, and interacting languages. Bilingual ASR systems can be language dependent and language independent. Language dependence which is a multi-pass approach that contains a language boundary detection (LBD) to find where the language switch occurs. The LBD divides the input utterance into segments that are language homogeneous. Each segment is identified using a language identity detection (LID) algorithm. Language dependence may be achieved over the phone set by using direct phone mapping set for each language. The corresponding language-dependent automatic speech recognition system (ASR) is then used. Language independence is a one-pass approach and it is considered to be a more holistic way to build a bilingual ASR. It involves building an acoustic model, a language model and a pronunciation dictionary that encompass the languages in the mixed language speech. Recognition is then done in a one-pass approach and language independent can be achieved over the phone set by using a universal phone set for both languages [1], [7]. Development of phone set in bilingual ASR can be achieved by using three techniques [22]. First, different language phone sets are mapped directly to develop a bilingual phone set. Second, bilingual phone sets are developed by mapping different languages' phone sets into a universal phone set such as International Phonetic Alphabet (IPA) and Speech Assessment Methods Phonetic Alphabet (SAMPA). Third, some methods merge several similar phone units of different languages into one phone unit according to spectral characteristics [10], [22]. Among the phone sets, IPA approach achieves the best results [22]. Table 3 describes some related works on the taxonomy which is highlighted in Fig. 4. Language dependence (multi-pass) is usually used for improvement of specific model while language independence (one-pass) is mostly used for implementing end-to-end bilingual ASR. Usually, researchers implement a code-switching ASR more than code mixing and this is due to the availability of the speech corpus and speaking style.

B. SOFTWARES AND TOOLKITS FOR BILINGUAL ASR DEVELOPMENT AND EVALUATION
Several open-source and freely available toolkits for building ASR systems are available. Hidden Markov Model Toolkit (HTK) is written in C programming language. The Carnegie Mellon University (CMU) Sphinx-4 toolkit is written in Java programming language. KALDI is a free and open-source toolkit for ASR research, which provides ASR systems based on finite-state transducers using OpenFst [3], [8], [23], [37], [38]. KALDI toolkit that is based on DNN, is widely used by many researchers [7], [10], [20], [32], [39]. In comparison, KALDI and HTK are more flexible and they allow the users to specify the number of states for each unit, whereas CMU Sphinx-4 has fixed the number of states to 5-state models. For language modelling, HTK supports the use of bi-gram models, whereas KALDI and CMU Sphinx-4 support both bi-gram and tri-gram language models, and more. In addition, KALDI toolkit is considered the stateof-the-art since it introduces DNN to the open-source ASR technology. According to [38], KALDI is an open-source toolkit for speech recognition, which is written in C++ and licensed under the Apache License v2.0. KALDI toolkit is easy to use, can be redistributed due to its license, and it supports speaker recognition and various other recognition tasks based on i-vector approach. It also supports the use of Graphics processing unit (GPU) for faster processing of the large volumes of data.
The performance of a bilingual ASR systems is usually measured in terms of accuracy and speed. The accuracy of the ASR system can be measured using Word Error Rate (WER), while the speed of the system is measured using real time factor. Other measures of performance include Character Error Rate (CER), Word Accuracy (WA), Mixed Error Rate (MER), Phrase Error Rate (PER) and Token Error Rate (TER). Both PER and TER apply word error rate and character error rate [5], [11], [21], [24], [35]. The performance of most ASR engines is measured in terms of WER [7], [17], [20], [21], [24], [32], [34], [36]. The WER is shown in the following equation: where I is the insertion error, S is the substitution error, D is the deletion error, and N is the total number of words in the testing transcriptions.

C. BILINGUAL ASR SYSTEMS
Several research works have been conducted on the bilingual ASR. Recognising imbalanced bilingual code-switched lectures with cross-language acoustic modelling and language identification for Mandarin and English, unit merging approaches of acoustic modelling were used for bilingual data sharing speech. Speaker dependent, speaker independent and speaker adaptation were used. The database used for implementing this work were three-course lectures, which were recorded in National Taiwan University. HMM was used as the acoustic model and MFCC was used as the feature extraction. An overall accuracy of 83.1% was achieved [19]. In another study, language modelling using weighted finite-state transducers (WFSTs) for bilingual ASR was implemented. 4,152 Mandarin-Cantonese parallel transcribed sentences with 19.4 hours of speech with additional 31M words of Mandarin transcriptions and a lexicon size of 28K were used. HMM was used as the acoustic model and MFCC as the feature extraction. 12.5% CER reduction over the baseline system was obtained [35].
End-to-end code-switching ASR was implemented for Hindi-English languages. Acoustic similarity and contextdependent transduction were developed. MFCC and LSTM were used for the acoustic model while factored language model was applied. The system achieved a word error rate of 29.79% when Nabu toolkit was used [20].
The ASR of English-isiZulu code-switched speech from South African Soap Operas had introduced a new corpus of spontaneous conversational English-isiZulu code-switched speech. The baseline ASR results were presented for monolingual English and isiZulu ASR systems, as well as for four configurations of code-switched English-isiZulu ASR systems, where HMM was used as the acoustic model and MFCC as the feature extraction, SRILM as the language model and HTK as the decoder.
A new language model configuration known as language dependent language model (LDLM) that consists of sub-models connected by explicit switch transitions, was proposed and evaluated. Experiments demonstrated that language dependent acoustic modelling (LDAM) outperformed language independent acoustic modelling with an average improvement of 2.8%. Language dependent language modelling (LDLM) outperformed language independent language modelling (LILM) when they were tested on the testing data set. Although the gain is currently minimal, it is promising if one considers that the context of the language model is currently not allowed to extend across the switch transition. For the LILM, such cross-language contexts do exist, because the language modelling data is simply pooled. Hence, despite a loss in context across language transition, the strategy of combining sub-language-models with explicit language transitions is successful. In an ongoing work, they are trying to address this shortcoming [17].
Grammar-constrained Mandarin-English bilingual ASR system for real-world music retrieval was developed using speech data from native Mandarin (865 hours), native English (232 hours) and Mandarin accented English (74 hours). Twopass phone-clustering method based on the confusion matrix was developed. The recognition rate for bilingual codemixing phrases achieved a relative PER reduction of 8.9%. HMM and MFCC were used to implement the acoustic model and feature extraction, respectively [11].
Firisan language is spoken by people living in north of Netherlands, who are considered as bilingual by using Frisian and Dutch languages in their daily conversation. FAME! Project involves the implementation of ASR system that can recognise the code-switched conversations [7].
Cross-lingual Spanish and Romanian speech recognition was developed using SpeechDat corpus. HMM was used to implement the acoustic model while MFCC was used as the feature extraction. The Romanian cross-lingual speech recognition system achieved a word accuracy of 80.48% [5].
Finding complex features for guest language fragment recovery for English and Chinese code-mixed ASR was carried out using KALDI toolkit. HMM/GMM was used to implement the acoustic model and MFCC was used as the feature extraction [12].
Cross lingual phone mapping of large vocabulary ASR for under-resourced languages was implemented for Malay and Hungarian. HMM/MLP and conventional HMM/GMM were used as the acoustic models. 9.0% WER was achieved when the ASR was trained with 55 minutes of English data. WER of 7.9% was obtained by using the full 15 hours of training data [40].  Further details on the developed bilingual ASR systems are shown in Table 4. The mostly used toolkit for developing bilingual ASR systems is KALDI toolkit. MFCC was used in almost all bilingual ASR systems as the feature extraction technique. HMM, GMM and DNN were used to implement the acoustic model. There are few techniques for implementing language model and the best toolkit was SRILM. VOLUME 10, 2022 Based on the summarised works on bilingual ASR in Table 4, it is clearly found that DNN, TDNN, RNN, and LSTM, are deep learning techniques that can be used for bilingual ASR tasks.
There are some techniques and procedures applied to enhance the performance. Multi-task learning of deep neural network (MTL-DNN) was applied in Kaldi Toolkit [7]. Subspace Gaussian mixture models (SGMMs) were used to generate AM of low-resource languages. SGMMs factorized the AM parameters into a shared set between all the states of HMM Kaldi Toolkit [32].
Multi-level framework was used to find complex features for the recovery of guest language for code mixing. Phonotactic, prosodic, linguistic and acoustic-phonetic cues were used to discriminate at the frame level between the host and guest language segment, as well as tuning the data imbalance ratio between the host and guest language. DNN-based methods, GMM-HMM and context dependent HMM-DNN (CD-HMM-DNN) recognition framework were among the best techniques to be used [12].
HMM-based code-switching system was implemented in HTK Toolkit and trained using HMM for serving code-switching language [17]. On the other hand, DNN-based systems were also used [7], [10]. Languageindependent and language-dependent targets were used by merging the phones of both languages [7]. Semi-supervised and active learning techniques were used to generate transcriptions for acoustic event in DNN-based detection system for code-switching speech segments [10].

D. EXISTING CORPORA
Within this literature, many corpora were used in Bilingual ASR research. The available corpora are summarised in Table 5.

V. OPEN ISSUES AND CHALLENGES ON BILINGUAL ASR
Many people use more than one language while speaking. For instance, a mix of Arabic and English languages are normally used by Arab citizens when speaking. This indicates that an ASR that serves only an individual language will face difficulties recognising the speech of any foreign language. In fact, foreign languages will be mapped to the closest and nearest vocabulary in the main language of the ASR system. Therefore, a problem always exists when people speak to any unilingual ASR system when combining more than a language in one single utterance or in their conversation [3], [4], [10], [13]. A serious challenge for implementing bilingual ASR is lack of bilingual data for training the models [23], [25], [40].
There are other problems that researchers face and need to be solved, where some words cannot be recognised due the low probability of the word while merging both languages together. In addition, there are multiple variations of a single word pronounced in the same way and have the same meaning in both languages. For example, the word ''computer'' in English language exists with the same pronunciation in Arabic language as '' '', even though computer has other variation in Arabic but it is still used. Moreover, a major concern is the phoneme sets that used for developing the bilingual ASR system. Some phonemes are shared in both languages. This makes the development of the acoustic modelling and the lexicon hard [19], [34]. Furthermore, if a language also has fewer phonemes than the foreign language. This problem may also cause a failure in recognition because the bilingual system does not recognise the foreign phonemes and predict the right output. Finally, there is the accent of bilingual speakers when they utter foreign words [3], [20]. The speakers mostly will be affected by their mother tongue accent [3]. These issues will cause many substitutions and deletions once the ASR system decodes the bilingual utterances. To the best of our knowledge and based on our literature investigation, this problem is not well investigated and requires more research efforts.

VI. CONCLUSION
This paper explores the bilingual ASR research which is conducted in the period of between 2010 and 2021. This research addresses the research background on bilingual ASR and its architecture, requirements, components, and dominating toolkits for bilingual ASR development, evaluation, results, taxonomy, the open issues, and challenges for bilingual ASR.
Bilingual ASR will be in a form of code-switching and code-mixing as an interacting language, and it is categorised based on the phone set and language dependency. The mostly used toolkit is KALDI while MFCC is the most popular speech feature extraction technique. SRILM and DNN are mostly used as the language model and acoustic model, respectively. Bilingual ASR field is still in need for future research to fill in the gaps by covering more languages and find the best techniques with the highest accuracy. His research interests include Arabic NLP, Arabic speech processing, text and speech corpora, language resources production, speaker recognition using biometrics, speech emotion recognition, Arabic sign language, sentiment analysis, blockchain, and software engineering.
TIEN-PING TAN received the Doctorate degree from Université Joseph Fourier, France, in 2008. He is currently a Senior Lecturer with the School of Computer Sciences, Universiti Sains Malaysia. His research interests include automatic speech recognition, machine translation, and natural language processing.