Exploration of End-to-End Framework for Code-Switching Speech Recognition Task: Challenges and Enhancements

The end-to-end (E2E) framework has emerged as a viable alternative to conventional hybrid systems in automatic speech recognition (ASR) domain. Unlike the monolingual case, the challenges faced by an E2E system in code-switching ASR task include (i) the expansion of target set to account for multiple languages involved, (ii) the requirement of a robust target-to-word (T2W) transduction, and (iii) the need for more effective context modeling. In this paper, we aim to address those challenges for reliable training of the E2E ASR system on a limited amount of code-switching data. The main contribution of this work lies in the E2E target set reduction by exploiting the acoustic similarity and the proposal of a novel context-dependent T2W transduction scheme. Additionally, a novel textual feature has been proposed to enhance the context modeling in the case of code-switching data. The experiments are performed on a recently created Hindi-English code-switching corpus. For contrast purposes, the existing combined target set based system is also evaluated. The proposed system outperforms the existing one and yields a target error rate of 18.1% along with a word error rate of 29.79%.


I. INTRODUCTION
Multilingual speakers often alternate between two or more languages (or dialects) during the conversation. In literature, this phenomenon is referred to as code-switching [1], [2]. The language to which the syntax of a code-switching sentence belongs is referred to as a native language, while that of the embedded foreign words is referred to as a non-native language [3]. The broad domains that carry out research on code-switching phenomenon are (i) linguistics [4], [5], (ii) language identification and diarization [6], [7], (iii) automatic speech recognition (ASR) [8]- [10], and (iv) language modeling [11], [12]. The scope of this work is limited to building an ASR system for the code-switching data.
Early works on code-switching ASR [9], [13], [14] happen to employ the hybrid framework typically developed for monolingual ASR task. The hybrid framework comprises three sub-modules, namely, a pronunciation model (PM), an acoustic model (AM), and a language model (LM). The PM takes into account the typical pronunciation variations of words by employing a dictionary. The AM involves the The associate editor coordinating the review of this manuscript and approving it for publication was Bohui Wang . creation of the statistical models for the sub-word units such as phonemes or senones, given the acoustic features. The LM captures the conditional probabilities of the next words given the observed word sequences and helps in reducing the search space. All these sub-modules are trained and optimized separately, hence the resulting system can be sub-optimal. Towards addressing that, the end-to-end (E2E) framework was proposed and successfully explored in the monolingual ASR task [15]- [20]. Two variants of the E2E framework include (i) the connectionist temporal classification (CTC) [21], [22], and (ii) the sequence-to-sequence modeling with an attention mechanism [15], [16]. In both these variants, the network is trained with characters as the output targets and does not include any explicit PM or LM. Thus, the E2E ASR framework does not require the phonetically labeled training data. For multiple languages being involved, these attributes become more attractive in the case of code-switching ASR. Motivated by that, the recent works have explored the E2E framework in the code-switching ASR domain. In the very first work [23], Seki et al. explored an E2E ASR system for code-switching task on an artificially created dataset obtained by concatenating the monolingual utterances. In contrast, Shan et al. [24] employed a real Mandarin-English code-switching dataset for developing the attention-based E2E ASR system. For improving the ASR performance, the multi-task learning (MTL) framework involving the language identification (LID) [25] was employed. In another work, Li et al. [26] explored a CTC-based E2E ASR system combined with frame-level LID for recognizing Chinese-English code-switching data. In a recent work on Mandarin-English code-switching ASR, Zeng et al. [27] experimented with data augmentation, MTL for LID, byte-pair encoding, and expansion of vocabulary in LM for N-best rescoring in the context of attention-based E2E framework.
In the existing code-switching E2E ASR works, the target set is derived by simply combining the character sets of the languages involved. It is argued that such systems would suffer from high confusability among the cross-language targets unless a sufficiently large amount of data is available for training. The possible cause of the confusability lies in the broad acoustic similarity among sound units involved in most of the code-switching language pairs. Also, for the enhanced target set, such systems would exhibit high computational complexity. In the context of low resourced modeling, one can avoid such a confusability if a common phone set covering the underlying languages in code-switching data is used as the output target. In an earlier work [28], we have proposed a common phone set towards building a hybrid ASR system for the Hindi-English code-switching task. Motivated by the above-cited reasons, we first explore the earlier defined common phone set as a reduced target set for developing an attention-based E2E Hindi-English code-switching ASR system using the HingCoS corpus [28]. Interestingly, the reduced target set based E2E ASR system outperformed the combined target set one in terms of the target error rate (TER). But, a reverse trend was noted when those target sequences were converted to word sequences, i.e., for computing the word error rate (WER). This degradation in WER is because of the enhanced confusability among the homophones (the words having identical pronunciation but different spellings) within or across the languages involved. For addressing the same, we have also proposed a context-dependent target-toword (T2W) transduction scheme that employs an explicit error model (EM) along with an LM to provide context information. Thus, this work is focused on addressing the following two issues in the context of code-switching E2E ASR: • High confusability among cross-lingual targets for combined target set modeling.
Further, to enhance the context information of the non-native language words, we have also proposed a novel textual feature referred to as the code-switching identification (CSI) feature. With the incorporation of the CSI feature in the training of factored language model (FLM), in particular, the recurrent neural network-based FLM (RNN-FLM) [29], more effective modeling of code-switching is achieved. The proposed context-dependent T2W transduction scheme is noted to achieve a relative improvement of 22% over the naive transduction scheme in the context of reduced target set based Hindi-English code-switching E2E ASR. Further improvements in the WER performances are achieved with the use of FLMs in T2W transduction. The proposed approaches are generic enough to be applied in any other code-switching context. In the context of code-switching E2E ASR, the notable contributions of this work are summarized as below.
• Development of a Hindi-English code-switching ASR system on the HingCoS corpus.
• Exploitation of acoustic similarity for the target set confusability reduction.
• Proposition of the context-dependent T2W transduction scheme for achieving enhanced WER performance.
• Proposition of a novel textual feature to deal with the intra-sentential code-switching.
The remainder of this paper is organized as follows: Section II presents a brief review of the code-switching phenomenon. Section III describes the development of attentionbased E2E ASR systems on both the combined and the reduced target sets. It also includes a discussion about the challenges arising with the reduced target set. The proposed context-dependent T2W transduction scheme is presented in Section IV. The discussion about the proposed CSI textual feature for enhancing the context modeling of the codeswitching data is presented in Section V. The details of the experimental setup, system description, and tuning of parameters for various systems developed are given in Section VI. The experimental results and their discussion are reported in Section VII. Finally, the paper is concluded in Section VIII.

II. A REVIEW OF CODE-SWITCHING PHENOMENON
Code-switching is a phenomenon in linguistics which refers to the use of two or more languages, especially within the same discourse. This phenomenon can be broadly classified into two modes based on the locations of the nonnative language words in the sentence. The code-switching that occurs at the boundary of the sentence is referred to as inter-sentential code-switching. While the one occurring within the sentence is referred to as the intra-sentential codeswitching [30]. In literature [4], [5], [31]- [34], the possible reasons for code-switching are attributed to the lack of appropriate words in the native language, emphasizing specific word/phrase, and showing expertise. As a result of the colonization and other historical factors, many multilingual communities have emerged across the globe. In turn, that led to the emergence of code-switching. The salient examples of the same are as follows. Spanish-English [35] in the United States of America, Arabic-English [12] in Egypt, French-German [36] in Switzerland, Frisian-Dutch [37] in Netherlands, Malay-English [9] and Mandarin-English [38]  in Malaysia, Mandarin-Taiwanese [6], [13] in Taiwan, Cantonese-English [39] in Hong Kong, English-isiXhosa, English-isiZulu, and English-Setswana [40] in South-Africa, and Hindi-English [41]- [43] in India.
In recent times, there is a greater need to handle the codeswitching phenomenon by the spoken-input systems. According to [7], [20], [44], unlike the monolingual case, the salient challenges posed by the code-switching phenomenon are (i) influence of native language on the pronunciation of nonnative language words within an utterance, (ii) requirement of expert linguistic knowledge and dedicated tools to handle the involved languages and (iii) lack of publicly available domain-specific resources. For augmenting resources in the Indian context, we recently created the HingCoS corpus [28] containing Hindi-English code-switching text and speech data. A few example sentences from the said corpus along with their respective English translations are given in Table 1.

III. EXPLORATION OF E2E ASR FRAMEWORK FOR HINDI-ENGLISH CODE-SWITCHING TASK
In this section, we report a maiden exploration of the E2E framework for the Hindi-English code-switching ASR task. The primary experimentation has been done in the context of attention-based E2E framework. Later, in Section VII-C, the proposed techniques are revalidated in the context of the CTC-based E2E framework for completeness.
The attention-based E2E framework, used in the primary experimentation, employs a popular architecture referred to as the listen, attend, and spell (LAS) [19]. The LAS architecture consists of three sub-modules: Listener, Attender, and Speller as shown in Figure 1. The Listener is a pyramidal bidirectional long short-term memory (BLSTM) network which acts as an encoder. Given a set of input features {x 1 , . . . , x N } corresponding to a predetermined context length N , the Listener produces an embedding h as h = Listener(x 1 , . . . , x N ).
For every time instance, the Attender produces a context c i , given the embedding h and the decoder state s i as The decoder state s i is computed by employing an LSTM network that takes the past decoded output label y i−1 , state The Attender acts like an alignment generator that determines which portion of h need to be considered for accurate prediction of the current output label y i . In order to predict y i , the current context c i and the previous predicted output label y i−1 are passed to the Speller which is an LSTM decoder as The entire network is trained to optimize the posterior probability defined as where λ represents the LAS model parameters and y * < i refers to the ground truth of the previously decoded targets.

A. COMBINED TARGET SET MODELING
Typically, the E2E ASR systems are trained for the character set of the spoken language as the output target, given the acoustic features. In the languages which involve both upper and lower case characters, the transcription is normalized to either of the cases. Thus, in the context of code-switching, such systems have to model the combined character set of the underlying languages. In this work, this approach is referred to as the combined target set modeling. The character sets of the Hindi and English languages are shown in Table 2. Using those, we built an E2E Hindi-English code-switching ASR system with 95 targets comprising of Hindi (68), English (26), and a special character '_' used for separating the words. For the details of the database and the LAS system used for the experimental evaluation, the readers are referred to Section VI. The TER of the developed system using the combined target set is reported in Table 3. For reference purposes, a few decoded character sequences are shown in Table 4. For converting the hypothesized target sequences to their corresponding word sequences, a scheme usually followed in  (26) languages that are used as targets in building the conventional E2E Hindi-English code-switching ASR system.

TABLE 3.
Evaluation of the E2E ASR system trained on the combined target set for the Hindi-English code-switching task. The WER is computed after the transduction of the hypothesized targets. The percentage of invalid words generated indicate the naiveness of the transduction scheme employed.
the literature is employed, and the same has been referred to as the naive transduction scheme in this work. In that scheme, first, all-white spaces between the targets are removed, and then each of the '_' labels is replaced by a single space to derive the hypothesized word sequence.
From Table 4, it can be noted that the output character sequences often get corrupted with cross-lingual character substitutions. Thus, even for a single character in the output sequence being misclassified, the naive transduction scheme would not find any valid word match in the task wordlist. In those cases, it outputs a unknown ( unk ) label, which in turn degrades the WER. Table 3 also shows the WER for the developed E2E system along with the percentage of unk labels in the output. The high percentage of unk labels shows the naiveness of the transduction scheme. The issue of cross-lingual confusability can be addressed either by considering a sufficiently large amount of speech data for acoustic modeling or by deriving the target labels by exploiting the acoustic similarity. In the next subsection, we attempt to reduce the confusability among the cross-lingual targets.

B. REDUCED TARGET SET MODELING
The reduced target set modeling refers to employing a lesser number of target labels than those involved in the combined target set modeling based E2E code-switching ASR system. Recently, in the context of the multilingual ASR task [45], the authors successfully used the union of phone sets of the underlying languages as targets to the E2E ASR system instead of the combined character set. Motivated by that, in an earlier work [28], we had defined a common phone set having 62 labels that cover both Hindi and English languages. In the same work, that phone set was also explored for developing a hybrid Hindi-English code-switching ASR system. For the ease of reference, the said phone set creation is briefly outlined next. We borrowed the phone set for the Hindi language from a composite phone set covering the majority of the Indian languages already defined for computer processing [46]. As the Hindi phone set is bigger, the English phones were heuristically mapped to corresponding Hindi phones having a broad acoustic similarity. And those which could not be mapped to Hindi phones were given unique labels. For more details about that mapping, the readers are referred to Table 4 in [28]. In this work, we employ that common phone set along with the special character '_' as the reduced target set in training the attention-based E2E ASR system for the Hindi-English code-switching task.
The reduced target set based E2E ASR system was trained following the identical setup as used for the combined target set based system discussed in the previous subsection. In Table 5, we show the decoded outputs produced by the reduced target set based E2E ASR system for the same set of example sentences as considered in Table 4. On comparing those tables, it can be noted that the reduced target set based E2E system exhibits a reduction in the cross-lingual target confusability and thus resulting in improved TER performance measure as given in Table 6.
For converting the reduced target set sequence to corresponding word sequence, a pronunciation dictionary for all Hindi and English words in the HingCoS corpus is created. During T2W transduction, each target segment separated by '_' labels is searched in the created pronunciation dictionary and is replaced with the word corresponding to it. In the case of homophone words, the one having the highest unigram count is chosen. If there is no match, then that target segment is replaced with the unk label. Following that, each of the '_' labels is replaced by a single space to produce the hypothesized word sequence. The WER, along with the percentage of unk labels, are also reported in Table 6. On comparing Tables 3 and 6, the proposed reduced target set modeling scheme is noted to provide a substantial reduction in the TER as well as the unk labels in the output. On the flip side, the WER gets significantly degraded. The possible causes of WER degradation are discussed in the following.

1) NAIVETY IN T2W TRANSDUCTION
We first highlight the weakness of the T2W transduction approach typically employed to produce WERs for E2E ASR systems. Let {T h i } n i=1 be the segmented hypothesized target sequence, {T c j } m j=1 be the segmented correct target sequence and {W j } m j=1 be the desired word sequence associated to {T c j } m j=1 . The objective of the decoder is to determine the desired word, given the corresponding segment of the hypothesized target sequence. In this transduction approach, the desired word W h i is produced by the decoder only when an exact match for a segment T h i is found in the pronunciation VOLUME 8, 2020 TABLE 4. Two sample decoded output sequences of the attention-based E2E ASR system developed using a combined target set for the Hindi-English code-switching task. The English translations of the sentences are given in the braces. The errors obtained in the hypothesized character sequences have been highlighted. Note that the symbol '_' is used to mark the word boundaries. The invalid words produced by the transduction process are labeled as unk .

TABLE 5.
Two sample decoded output sequences of the attention-based E2E ASR system trained on the reduced target set for the Hindi-English code-switching task. For contrast purposes, the sentences are kept the same as considered in Table 4. dictionary D. For a single entry of T h i being in error, no valid word corresponding to it would be found in D. In such cases, the unk label is emitted. More formally, the same can be expressed by a decoder function F as From Eqn. 2, it can be noted that, unless the TER of the E2E ASR system is very low, the derived word-level hypothesis will lead to a degraded WER due to the presence of unk labels.

2) HOMOPHONE CONFUSABILITY
Almost every language has some homophones, with the English language having a large number of them. In the context of code-switching, the homophone issue gets further enhanced as they may occur across the languages too. For the ease of illustration, we have grouped them into five broad categories, namely proper nouns, internalized collective/abstract nouns, abbreviations, intra-language, and inter-  language homophones. This categorization has been done based on parts-of-speech and language information. Table 7 lists a few examples of those homophone categories for the Hindi-English code-switching case. There are 96 homo-phones in the HingCoS corpus and their distribution in terms of earlier defined broad categories is shown in Figure 2. It is worth noting that, a large number of homophones belong to the inter-language category. Despite the reduction of target set confusability with the proposed target set, it was observed that the confusability during T2W transduction gets enhanced for homophones. Referring to Example 2 in Table 5, it can be observed that, for a proper noun having the hypothesized target sequence ''h i n dx ii'', the T2W transduction process can yield either ''hindi'' or '' '' as the word output. A high frequency of such errors ends up degrading the WER.
In the context of hybrid ASR task involving Hindi-English code-switching, the authors in [47] faced a similar challenge and explored merging of some identical sounding words based on the unigram counts in their database. It is argued that, by following such an approach, we can handle only a sub-set of homophones (proper nouns, abstract nouns, and abbreviations) but not the set comprising intra-and interlanguage homophones. For effective handling of the latter set of homophones, we would require a more in-depth context information rather than unigram counts.

IV. CONTEXT-DEPENDENT T2W TRANSDUCTION
In this section, we propose a context-dependent T2W transduction process developed by exploiting modularized decoding for addressing the earlier highlighted issues.
In hybrid ASR literature, a few works have already explored the modularized decoding for T2W transduction. Demuynck et al. [48], [49] proposed a two-step decoding process that employed morpho-syntactic and morphophonologic constraints for T2W transduction. Following that work, Zweig and Nedel [50] presented an empirical study on the error-robustness of T2W transduction across a variety of languages. For that study, the decoding objective for the transduction of i th hypothesized segment is formulated as shown below. arg max In the above formulation, the first and second factors respectively denote the LM and the PM, while the third factor accounts for the EM. With the maximization performed over all possible correct target sequences T c i and their corresponding words W i , the earlier discussed unk and homophone issues can be resolved. Exploiting these observations, a novel T2W transduction scheme for E2E ASR systems has been evolved and is explained below.  The hypothesized target sequences produced by an E2E ASR system may contain one or more errors. For the error modeling purpose, we have employed the Levenshtein (edit) distance-based search. Let {T h i } n i=1 denote a hypothesized target sequence having n segments. For each segment T h i , we have determined all possible (say p) pronunciation sequences {T c j } p j=1 in PM having edit distances up to a predetermined threshold . 1 For the reduced target set case, those sequences may further map to homophones within or across the languages. Let a set {W i k } m k=1 denote all possible (say m) words returned by that search. On appending each of those words to the current partial sentence S, a corresponding new partial candidate sentence is constructed. All those constructed sentences are now pruned based on the context information derived from an appropriate LM and the 1-best sentence is generated. When all segments in T h get processed, the 1-best output yields the final transducted output. The overall flow diagram of the proposed T2W transduction scheme is shown in Figure 3.
The innovation in the proposed scheme is demonstrated with the help of an example shown in Table 8. The top-two rows of that table show the word-and target-level reference transcriptions for an example utterance. The output generated by the reduced target set-based E2E ASR system for that utterance is given in the third row. Whereas, the last-two rows correspond to the outputs produced by the naive and the proposed transduction schemes. On comparison, it can be noted that the proposed scheme not only avoids unk labels but also can handle intra-and inter-language homophone pairs such as ''light-lite'' and ''co-'' , respectively. The first attribute refers to effective error modeling, while the second one is the result of context modeling. The LM scores of the candidate sentences clearly show how effectively the homophone issue gets resolved. Thus, LM plays a vital role in the proposed T2W transduction scheme.
The effective language modeling of code-switching text data is a research problem in itself. Towards that end, a novel textual feature is discussed in the following section.

V. CODE-SWITCHING TEXTUAL FEATURE FOR ENHANCED CONTEXT MODELING
In the context of code-switching, a few works have already shown that the FLMs trained with parts-of-speech (POS) information are more effective than the traditional LMs [35], [51]. In intra-sentential code-switching, the non-native words are embedded into the native sentences, mostly without affecting their structure. In an earlier work [52], we had exploited those structures to adapt the monolingual LM to deal with code-switching data. In this section, we demonstrate how those structures could be exploited in the direct modeling of code-switching data. For that purpose, we propose a textual feature that can be used in FLM training similar to POS tags. In the following, extraction of the proposed textual feature and the RNN-FLM paradigm in which it is included, are described.

A. PROPOSED TEXTUAL FEATURE
The proposed textual feature is referred to as code-switching identification (CSI) feature in this work. It not only marks the location of code-switching but also provides information about the equivalent native (Hindi) word. The procedure for extracting the CSI feature for Hindi-English codeswitching data is described next. In the HingCoS corpus, the Hindi and English words appear in Devanagari and Latin scripts, respectively. First, we pass every English-scripted word (w E ) through a machine translator to get its equivalent Hindi-scripted word (w H ). Later, a string {W-w E :Sw H :C-Yes} is emitted, which comprises the word w E along with the CSI feature. Where the identifiers W-, S-, and C-denote input-word, switched-word, and code-switching status, respectively. Similarly, for Hindi-scripted word (w H ), the string including the CSI feature, is emitted as {Ww H :S-w H :C-No}. Note that, as there is no code-switching, W-and S-are tagged with the same word w H . The struc-  ture of the above strings follows the syntax of the FLM toolkits [53], [54].
The algorithm for the proposed CSI feature extraction is given as a flowchart in Figure 4. Also, an example Hindi-English code-switching sentence tagged with the proposed CSI feature is shown in Figure 5.

B. RNN-FLM ARCHITECTURE
The FLMs incorporate morphological features or linguistic features of the word w t while training the LM [55]. For training the FLMs, the appropriate set of features are derived either using linguistic knowledge or using the data-driven techniques. In this technique, each word w t in the vocabulary is represented as a group of k features denoted as In recent works, the RNNs are also used in training the FLMs and have shown significant improvement in recognition performances [29], [56]. Motivated by that, in this work, we have employed the RNN-FLM architecture to model the code-switching data by including the CSI textual features. The RNN-FLM predicts the posterior probability of the current word w t as  where, F t−1 denotes the feature vector corresponding to w t−1 , i.e., the previous word, s t−1 refers to the RNN state [57], and c t represents the class to which the word w t belongs to [58]. Those classes are derived by partitioning the vocabulary of training data into groups based on the word counts which helps in reducing the search complexity. The network architecture of RNN-FLM while highlighting the component variables is shown in Figure 6.

VI. EXPERIMENTAL SETUP A. DATABASE
In this work, a recently created HingCoS Corpus 2 has been used for the experimentation purpose. The text data in the HingCoS corpus consists of 25988 Hindi-English codeswitching sentences and has a vocabulary of 14643 words (6029 Hindi and 8614 English). The lengths of sentences vary from 3-57 words and on an average, there are 3-4 codeswitching instances per sentence. For a total of 9251 Hindi-English text sentences in the HingCoS corpus, the corresponding speech data spoken by 101 speakers (61 males and 40 females) is also available. The speech data, being collected over telephones, is sampled at 8 kHz with a resolution of 16 bits/sample. The total size of the speech data is about 25 hours. The salient statistics of the HingCoS corpus are summarized in Table 9.
The available speech data is divided into three nonoverlapping sets having 7115, 160, and 1976 utterances for 2 www.iitg.ac.in/eee/emstlab/HingCoS_Database/HingCoS.html training, development, and testing of the E2E ASR systems, respectively. These sets are also non-overlapping in terms of the speakers involved. For language modeling, excluding the earlier defined acoustic test set, the remaining text data is partitioned into training and development sets having 22700 and 1312 sentences, respectively. In this way, both acoustic and language models are evaluated on the same test set.

B. SYSTEM DESCRIPTION AND PARAMETER TUNING 1) ATTENTION-BASED E2E ASR SYSTEM
The Nabu toolkit [59] is used for developing the LAS architecture-based E2E ASR system. The parameter settings used for analyzing the speech data include window length of 25 ms, window shift of 10 ms, and pre-emphasis factor of 0.97. The 40-dimensional log Mel-filterbank energies per speech frame are used as features for acoustic modeling. The details of the LAS architecture are as follows. The listener has 3 pyramidal BLSTM layers, with 256 units in each layer. The pyramidal step size is kept as 2, and the dropout rate in training is set to 0.5. The speller has 2 LSTM layers, with 256 units in the input layer. The dropout rate for the speller is also set to 0.5. The average cross-entropy loss is used as a loss function. The model is trained for 300 epochs with a batch size of 32 and learning rate set to 0.1 with decay 0.01. For decoding, a beam-search decoder with beam width set to 10 is employed.
For training the LAS network, the number of epochs and the number of hidden units in the input layer of the encoder are selected performing the tuning experiments on the acoustic development set. The tuning of the LAS network is done for the combined target set case, and the same parameters are fixed even for the reduced target set case. The plots showing the trends of those experiments are given in Figure 7. Note that tuning for the number of epochs is done by keeping the number of nodes as fixed and vice-versa. The remaining parameters are set to their default values as defined in the toolkit. The TER is found to saturate after 300 epochs, while it degrades beyond 256 nodes.

2) RNN-BASED LMs
For the experimentation purpose, both simple and factor modeling-based RNN-LMs are developed using the RNN-LM toolkit [54]. Both kinds of RNN-LMs are developed employing identical network architecture with sigmoid as the non-linearity function. After performing the tuning of simple RNN-LM on the linguistic development set, the salient parameters of the architecture are a hidden layer with 200 nodes, the value of back-propagation through time variable set to 5, and the number of classes set to 100.

VII. EXPERIMENTAL RESULTS
In this section, we present the evaluation of both the proposed context-dependent T2W transduction scheme and the CSL textual feature in the context of the Hindi-English codeswitching ASR task.

A. EVALUATION OF THE T2W TRANSDUCTION
For the primary proposal in this work, i.e., the reduced target set for the E2E ASR system, the results are already discussed in Section III. From the experiments done on the HingCoS corpus, it can be deduced that the proposed reduced target set modeling yields about 17% relative improvement in TER in contrast to the combined target set case. Towards addressing the challenges in T2W transduction with the reduced target set modeling, we have also proposed a context-dependent T2W transduction scheme as the secondary contribution. Table 10 presents the detailed evaluation of the same while studying the impact of both the error model and the inclusion of context information. The proposed transduction scheme yields a WER of 31.09%, which happens to be 22.6% relative improvement over that of the naive transduction scheme. Further, to study the impact of context information on the transduction performance, we also evaluated the proposed scheme with unigram LM and with no LM. The latter case refers to randomly choosing one among the word possibilities available during the error modeling. From those results, we can conclude that most of the improvement in the T2W transduction performance has been achieved on account of better context modeling. For comparison purposes, the results for the combined target set modeling case are also given in Table 10. Unlike the reduced target set modeling, the homophone issue does not crop up in the combined target set case, despite that the reduced target set modeling results in the best WER.

B. EVALUATION OF THE CSI TEXTUAL FEATURE
As argued earlier and also obvious from Table 10, for the reduced target set case, the T2W transduction performance is highly dependent on the quality of the context information provided by the LM. Therefore, as the third contribution of this work, we have proposed the CSI textual feature for improving the code-switching LM. The same has been evaluated separately in language modeling and speech recognition tasks. Table 11 shows that with the inclusion of CSI feature, about 23% relative reduction has been achieved in the perplexity (PPL) score in comparison to default RNN-LM. This improvement is attributed to (i) the binary categorization of code-switching, and (ii) tagging of code-switching (English) words in the training data to their corresponding native (Hindi) words. For the code-switched words having a little or no evidence in the training data, the FLM falls back to their equivalent Hindi words, if those exist in the vocabulary. For better contrast, we have also experimented with the POS textual features extracted by following a procedure similar to that used in [35]. The PPL scores for standalone inclusion of the POS features and their combination with the CSI feature are produced and given in Tables 11. Also, these textual features are evaluated for the proposed T2W transduction scheme, and the performances in terms of WER are reported in Table 12. From that table, it can be noted that a similar trend has been noted in WER scores as that of the PPL scores given in Tables 11. Note that, unlike the CSI feature, the grouping induced by the POS features are targeted towards sentence structure rather than code-switching. This could be the reason why the CSI feature not only outperformed the POS features but also provided an additive improvement when combined with the POS features.

C. REVALIDATION IN ALTERNATE E2E FRAMEWORK
For a thorough evaluation, the proposed approaches have also been evaluated in the CTC-based E2E ASR framework. In contrast to the attention-based system, the CTC-based E2E system consists of two modules: a deep BLSTM encoder, and a CTC decoder. The deep BLSTM network encodes input feature vector x into a higher-level representation. The decoding is performed using the CTC, a loss function that  assumes the outputs generated at different time steps to be conditionally independent. The CTC allows training of the network without requiring a prior alignment between input and output sequences. In the CTC decoder network, the output softmax layer has one unit each for the targets in addition to a blank symbol denoting the null emission. For a given training utterance, there are many possible alignments. At every time-step, the network decides whether to emit a symbol or not. As a result, the distribution over all possible alignments between the input and target sequences is obtained. To produce the probability of output sequence given the input, a dynamic programming-based algorithm (forward-backward) is employed to obtain the sum over all possible alignments. Given a target transcription y and the input feature vector x, the network is trained to minimize the CTC loss function as where P(y|x) = a∈β(y, x) P(a|x), a is an alignment, and β(y, x) is set of all possible sequences between y and x. In our implementation of the CTC-based system, the DBLSTM encoder network consists of 4 layers and 256 units in each layer. The remaining network training parameters are kept the same as mentioned in Section VI-B1. The CTC-based system is trained and evaluated on identical data partitions, as already described in Section VI-A.  The evaluations of the proposed context-dependent T2W transduction scheme and the CSI textual feature for CTCbased Hindi-English code-switching E2E ASR have been done, and the results are reported in Tables 13 and 14, respectively. On comparing with the corresponding performances of the attention-based system, it can be noted that the CTCbased E2E framework has exhibited similar performance trends. The proposed transduction scheme yields a WER of 35.07%, which happens to be 16.91% relative improvement over that of the naive transduction scheme. Also, when the LM is trained with combined textual features, an improved WER of 33.17% has been achieved.
For the ease of assessment of the relative impact of the proposed context-dependent T2W transduction with/without textual features, the performances for the attention-and CTCbased E2E ASR systems are summarized in Figure 8. It can be noted that a similar trend in the performances is observed for both E2E frameworks. At the same time, we wish to point out that the developed CTC-based E2E system does not incorporate any character-level LM while decoding the combined/reduced targets. Therefore, the performances of the attention-and CTC-based E2E ASR systems cannot be directly compared.

D. COMPUTATIONAL COMPLEXITY
All systems are developed on a HP-Z440 workstation. The memory requirement and the computational complexity for different systems along with the key specifications of the said VOLUME 8, 2020 workstation, are given in Table 15. From that table, it can be noted that, the reduced target set based E2E ASR system training takes much lesser memory and computational time when compared to the combined target set case.

VIII. CONCLUSION
This paper explores the development of code-switching E2E ASR system on limited resources. For efficient modeling of the code-switching E2E ASR system, the acoustic similaritybased target reduction scheme has been proposed. Towards converting the hypothesized target sequence to the desired word sequence, a context-dependent transduction scheme has been developed. Further, a novel textual feature has also been proposed, which enables more effective context modeling of code-switching data. The proposed approaches are noted to consistently outperform the combined target set based E2E ASR modeling in terms of target/word error rate. The work also presents a detailed description of the Hindi-English code-switching E2E ASR system. To the best of authors' knowledge, for the Hindi-English code-switching task, such a system is yet to be reported.
Despite the evaluations being performed in the context of Hindi-English code-switching ASR, the proposed techniques are generic enough to be applied for any other codeswitching context. Also, the reduced target set based E2E ASR systems training take much lesser memory and computational time when compared to the combined target set case. A few shortcomings of the proposed methods include (i) the requirement of a pronunciation model in the contextdependent T2W transduction scheme for reduced target set case, and (ii) the dependency of the CSI and POS features on the quality of the machine translator and the POS tagger employed, respectively.
In the future, we would be interested in incorporating the T2W transduction into the E2E system to optimize the developed system directly at word-level instead of characters.