A Systematic Review on Language Identification of Code-Mixed Text: Techniques, Data Availability, Challenges, and Framework Development

The mix of native language with other languages (code-mixing) in social media has posed a severe challenge for language identification (LID) systems. It has encouraged research on code-mixed LID solutions. Four things have been identified in this study, such as techniques, challenges, and dataset availability with corresponding quality criteria and developed a comprehensive framework for code-mixed LID. Also, we identified gaps and future work opportunities in tackling code-mixed LID challenges. Based on our analysis of reviewed studies, we outlined key points for future research in code-mixed LID. We demonstrated a taxonomy of applied techniques for code-mixed LID and highlighted the different technique variants. In code-mixed LID tasks, we discovered four significant challenges: ambiguity, lexical borrowing, non-standard words, and intra-word code-mixing. This systematic literature review recognised 32 code-mixed datasets available for LID. We proposed five features to describe the quality criteria datasets, such as the number of instances or sentences, percentage of code-mixed types in the data, number of tokens, number of unique tokens, and average sentence length. Finally, we synthesised the methodologies and proposed a conceptual framework for subsequent studies through our literature analysis.


I. INTRODUCTION
With the advent of social media, human interaction has become limitless. Social media platforms have become an integral and inseparable part of human life. We can connect with people from all over the world through social media to exchange and spread information. For instance, we can leverage social media to increase customer engagement and thus generate brand exposure, leads, sales, and revenue in the business domain.
The associate editor coordinating the review of this manuscript and approving it for publication was Kathiravan Srinivasan . In social media, individuals write posts without adhering to the standard language of communication [1]. For example, multiple languages in a single sentence or utterances within social media texts. It is common for people who live in a multilingual culture and know many languages to switch from one language to another [2]. People often express their thoughts on social media in mixed languages using their native language and English [3]. In linguistics, this is known as code-mixing, which refers to the embedding of linguistic units from one language into the usage of another language by using phrases, words, and morphemes [4].
Code-mixing is commonly encountered during spoken and written communication in multilingual communities [5], [6], for example, Indonesian-English [7], [8], Malay-English [9], Persian-English [10], Hindi-English [11], and English-Bengali [12]. Code-mixing can be divided into intrasentential, intra-word, and inter-sentential. Intra-sentential code-mixing is a term that refers to occurrences of mixing languages within a sentence. Intra-word code-mixing refers to the mixing of languages in a word. Inter-sentential codemixing happens when languages are mixed across sentences. Because of variances in spelling and grammar, code-mixing in social media material is a daunting task in natural language processing [7]. Consequently, code-mixed text requires more pre-processing tasks than monolingual text data [13].
One of the pre-processing tasks that is frequently applied in analysing code-mixed text is language identification (LID). LID refers to the automatic identification of languages used in a document [14], [15]. LID is crucial for downstream natural language processing (NLP) applications, such as sentiment analysis and machine translation [15], [16], [17], [18]. Applying LID for such NLP applications may significantly impact the system's performance [11].
Most LID studies, however, focus on identifying a single language at the document or sentence level. Determining languages in a code-mixed text, therefore, remains an unresolved problem. Performing LID tasks at the document or sentence level are frequently inadequate for extracting critical information from the text [18]. Also, relying on language tags at the document or sentence level makes language detector systems fail to detect language correctly due to the mixed language in the sentences [19], [20]. Thus, researchers were motivated to shift their focus from document or sentence level to token-level language identification.
One notable gap in current research is the need of codemixed datasets for low-resource languages. Low-resource languages are less common and studied as a result of scarce resources [21]. Regarding code-mixed data, language pairs involving languages from South Asia (Hindi and Bengali) and English are prevalent [2]. Exploring additional language pairs for low-resource languages is highly encouraged, accordingly. The new language pair datasets are necessary to help solve code-mixed LID problems in languages commonly used but lacking resources. Apart from that, the lexical look-up or dictionary-based approach cannot cope with the presence of borrowed words or code-mixing [22]. Another problem is the failure to get context information due to ambiguity and irregular phonetic typing in the code-mixed text [16], [23], [24].
We found a few literature reviews related to code-switching and code-mixing text with different research focuses, such as the application of code-mixing [25], a survey on the code-switched dataset [2], and sentiment analysis of a codemixed text [24], [26], [27]. To the best of our knowledge, there has not been a comprehensive literature review that explicitly highlights the latest techniques of LID and reviews its challenges for code-mixed texts. Such a survey would benefit relevant researchers in NLP and text processing.
This systematic review aims to examine the current state of research in the LID field for code-mixed texts. The objectives of this study are: (1) to investigate the most recent techniques developed for solving LID tasks for code-mixed content; (2) to explore the resolved and unresolved challenges associated with LID tasks for code-mixed text; (3) to investigate the availability and quality of code-mixed datasets for LID; and (4) to develop a general framework for code-mixed LID.
The remainder of this paper is organised as follows. Section 2 describes the research methodology. The result and discussion of this study are explained in Section 3. Section 4 presents the implications of this literature study. Finally, the conclusion is described in Section 5. A list of abbreviations used in this literature review paper is presented in Table 1.

II. RESEARCH METHODOLOGY
This section consists of a system of guidelines for designing and analysing the studies of language identification in codemixed data. This literature study follows the systematic literature review methodology by [28] and [29]. The review's VOLUME 10, 2022 strategy includes establishing the study population (studies of language identification in code-text data), identifying resources from where the population is sourced, listing search string keywords, and determining the inclusion and exclusion criteria to generate the population relevant to this study. The research methodology is conducted by applying the following review strategies: (1) designing research questions; (2) searching related studies from databases using defined search strings; (3) applying predetermined inclusion and exclusion criteria; and (4) applying quality assessment criteria.

A. RESEARCH QUESTION
This study aims to answer the following research questions to highlight critical practical aspects of language identification of code-mixed text. The four research questions addressed in this literature review are as follows: RQ1: Which techniques and features have been used for code-mixed LID text in bilingual and multilingual?
The response to RQ1 allows us to learn about the techniques used to solve code-mixed language identification task. Examining previously used methods will provide insight into the state-of-the-art, advantages and limitations of LID in code-mixed data. The findings will demonstrate the most recommended technique and features for dealing with codemixed text LID. Also, we expect to obtain some additional features applied in LID for code-mixed text.
RQ2: What are the challenges in LID of code-mixed text?
The RQ2 aims to identify the open challenges LID for code-mixed text. Understanding the challenges and the current state of the art is necessary for determining the research gaps in the previous studies that are currently not addressed or answered adequately. The findings would also assist in directing future work, considering resolved and unresolved issues in code-mixed LID.
RQ3: What datasets are available for LID of code-mixed text? What are the quality criteria for the dataset?
This RQ3 aims to determine the availability of code-mixed datasets and quality criteria for language identification using code-mixed text from various languages. The investigation of the availability of code-mixed datasets allows us to determine how many datasets are bilingual and multilingual code-mixed datasets. We will also learn about the popular mixed languages studied and those less studied. Answering RQ3 allows us to know the source benchmark datasets and prepare the scope of our experiments for evaluating our LID methodology in code-mixed text. The dataset quality criteria provide a set of properties and policies to determine the dataset's quality and completeness. We can evaluate the dataset's quality by identifying the relevant properties to measure as our proposed quality criteria.
RQ4: What is the standard workflow for language identification of code-mixed text for future research?
The RQ4 allows us to know the future directions of codemixed LID research. To answer RQ4, we propose a framework developed for the code-mixed LID task. The framework can be leveraged as a standard guideline for those researching code-mixed LID.

B. DATA SOURCES AND SEARCH STRATEGY
In this work, we referred to [28] and [30] to find related articles through electronic databases. We selected five electronic sources to gather our references. Through the electronic sources, we investigated all available materials pertaining to the objectives of this systematic literature review [31]. Search strings (keywords) were developed to collect related research papers responding to the research questions. The search strings were developed using critical terms within the topic field and the purpose of the review [32]. The selected electronic sources and search strings for this literature study are provided in Table 2.

C. INCLUSION AND EXCLUSION CRITERIA
This section discusses the inclusion and exclusion criteria applied in our literature study. Meta-data and abstracts of papers were reviewed to determine which studies should be included in the review and removed irrelevant articles [33]. The following criteria were applied for inclusion: (I1) Studies published between 2016 and 2021; (I2) full-text papers; (I3) papers written in English; (I4) papers related to language identification for code-mixing or code-switching text. We excluded those articles that did not satisfy the inclusion criteria from the study. Also, any publications that did not match any of the excluded criteria were excluded.
The inclusion and exclusion criteria for this study are presented in Table 3. The following are the exclusion criteria to eliminate irrelevant papers: (E1) papers not written in the English language; (E2) papers that do not focus on natural language processing fields; (E3) papers that do not discuss language identification for code-mixing or code-switching text; (E4) grey literature, such as working papers, dissertation/theses, and research reports. The QA [28] is used in this systematic literature review to determine the strength of the selected studies [34]. The QA was developed using tools such as a checklist of all aspects or queries required to be applied to each study. The following questions were developed as the QA criteria for each study:

E. STUDY SELECTION PROCESS
This section explains the selection process to determine relevant studies that fulfil all the research questions. The study selection task is done in four phases: identification, screening, eligibility, and included studies. In this study selection stage, we utilised PRISMA flow diagram as a reporting guideline [35]. Figure 1 illustrates the PRISMA flow diagram of the systematic review protocol in this work. In the identification stage, we searched the literature from five electronic databases using predefined keywords and obtained 233 research papers. Firstly, we screened the retrieved research by removing duplicate papers. A total of 70 papers were removed in the screening stage. We applied inclusion and exclusion criteria to 163 articles, and a total of 86 studies were eliminated. After that, the rest of the 77 articles were assessed using the five quality criteria, and we excluded 37 papers in this stage. Finally, a list of 40 research papers (referred to as selected studies) was included in this literature review, 8 papers (20%) from journals and 32 papers (80%) from conferences. Table 4 shows the number of selected papers in each stage.

III. RESULT AND DISCUSSION
This section presents the findings of the primary studies on LID for code-mixed text. We divided our discussion into three subsections in response to the respective research questions explained earlier. The first subsection addresses RQ1 regarding existing LID techniques in bilingual and multilingual  code-mixed text. In the second subsection, we present our findings concerning RQ2, which investigates the challenges of code-mixed LID. Subsequently, we provide our findings regarding dataset availability and quality criteria for evaluating code-mixed LID tasks. Finally, we provide a framework in response to RQ4, which is explained in the last subsection.
In the following part, we examined the characteristics of existing techniques used in code-mixed LID from the 40 selected studies. We intend to highlight and discuss the existing techniques and their properties to update researchers for future work. VOLUME 10, 2022 1) APPROACHES AND APPLIED TECHNIQUES Figure 2 depicts the taxonomy of approaches and applied techniques implemented for code-mixed LID from the selected studies. Based on our investigation, two primary approaches were identified, machine learning and nonmachine learning. Machine learning can be divided into two main categories, supervised and unsupervised. We identified three groups for the supervised one: non-neural network-based (12 unique techniques), neural network-based (9 unique techniques), and hybrid technique (2 unique techniques). Support Vector Machine (SVM) and Conditional Random Fields (CRF) were the most utilised supervised technique. SVM and CRF have been implemented in 14 studies, followed by Naïve Bayes in 12 studies. Logistic Regression and Random Forest were used in 8 research, respectively. Decision Tree and K-Nearest Neighbour (KNN) were applied in 6 and 3 studies. We found 2 studies that applied AdaBoost and HMM. The remaining methods (XGBoost, Linear Discriminant Analysis, and Quadratic Discriminant Analysis) were utilised in one study severally.
We found several neural network-based techniques, such as Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and RNN variants, such as Long Short-Term Memories (LSTM), Gated Recurrent Unit (GRU), Bidirectional LSTM (BLSTM), Segmental Recurrent Neural Network (SegRNN), and Transformer-based. There are five methods in the Transformer-based, namely XLM-RoBERTa, ELECTRA, BERT, DistilBERT, and Camem-BERT. We encountered three unique techniques for the unsupervised approach: Hidden Markov Model (HMM), Morfessor, and the unsupervised dictionary-based approach. As for non-machine learning, two techniques were recognised, rule-based and lexicon-based. We came across three previous studies that implemented rule-based, and two studies used lexicon-based. Table 5 summarises relevant literature for code-mixed language identification. We summarised the following information: year of publication, languages identified, applied techniques, language identification level, and reported the best performance. The technique with the best performance in the applied techniques column is highlighted in bold and italics. The best performance is presented to demonstrate the effectiveness of the technique used for that specific dataset or language identification problem. We identified the best performance based on the best achievement of the applied techniques in the reported literature. Other methods are also compared but not presented in the table. The best performance from each study is given based on the highest accuracy, F1 score, precision, or recall reported in the investigated papers. The discussion of each approach will be described in detail in the following subsections.

a: MACHINE LEARNING APPROACH
This section describes the utilisation of both supervised and unsupervised approaches. Among the 40 papers, we found that supervised approaches were used more often than unsupervised approaches. The supervised learning approach requires annotated training data as a model to predict output for the new data [67], [68].
SVM was the most widely used technique by researchers in language identification tasks. SVM is often implemented due to its capability to build an efficient classifier model and produce good performance [46]. From the selected studies, SVM has shown impressive performance. Veena et al. [46] utilised a linear kernel SVM classifier and could achieve an accuracy of 93% for word-level Malayalam-English and 95% for Tamil-English code-mixed LID. Chaitanya et al. [47] incorporated several machine learning methods with Word2Vec embedding for Hindi-English. Based on their experiments, the SVM using Skip-gram reached the highest accuracy of 67.34%. SVM and word embedding were also implemented by Sarma et al. [18] in their study. Their work demonstrated that the SVM using word embedding obtained better results than Naïve Bayes and Convolutional Neural Network (CNN) with an F1 score of 90.61%.
Kalita and Saharia [20] applied linear kernel SVM with N-gram and dictionary features to identify Assamese-English code-mixed language. They obtained 89.51% accuracy in word-level identification. Shanmugalingam et al. [50] presented that SVM with linear kernel performed the best with an accuracy of 89.46% for Tamil-English code-mixed LID. In Kazi et al. [58], they implemented the Support Vector Classifier (SVC), one of the SVM variants. The result showed that the SVM with RBF kernel and N-gram features obtained the best accuracy of 92%.
Code-mixed LID task can be categorised as a sequence tagging problem, and Conditional Random Fields (CRF) can be adopted to solve it. CRF is a statistical modelling method commonly utilised in sequence tagging problems, such as named entity tagging, POS tagging, and language identification. While an ordinary classifier may predict a label for a single sample without considering its neighbours, CRF can take context into account to make more accurate predictions [38]. In this literature study, we found that 8 of 14 papers utilised CRF with satisfying results.
Lamabam and Chakma [38] developed a code-mixed LID system for Manipuri-English using CRF with characters as features and achieved an F1 score of 90%. CRF method has been implemented by [41] to build a tweet-level and token-level LID of code-mixed text for Spanish-English and Arabic-Modern Standard Arabic (MSA). The result showed that CRF gave good results, with an F1 score of 83% for the tweet level and 94.9% in overall token-level accuracy.
In Phadte and Wagh [44], CRF outperformed SVM and Random Forest techniques with an accuracy of 94%. Gundapu and Mamidi [49] experimented with four machine learning approaches, namely Naïve Bayes, Random Forest, Hidden Markov Model (HMM), and CRF. Among these classifiers, the CRF presented the best accuracy of 91.28%. Yirmibeşoğlu and Eryiğit [52] obtained 95.6% micro-F1 using CRF with character-level N-grams. Mave et al. [15]    found that the CRF model's performance presented better results than the deep learning models (LSTM and BLSTM). The result showed that they could provide an F1 score of 98% and 96% for the Hindi-English language pair. Barik et al. [7] demonstrated code-mixed LID for Indonesian-English using a small dataset from Twitter. In their work, the CRF obtained an 89.58% F1 score and an accuracy of 90.11%. Finally, in Mishra and Sharma [55], CRF accurately identified multilingual code-mixed with 97.77% accuracy and 95% F1 score.
Naïve Bayes was the third most common technique for code-mixed LID with 12 studies. It is often used as a baseline model due to its simplicity. It uses the Bayes probability theorem to forecast the class of unknown datasets and the model assumes no relationship exists between the input features [65]. The naïve Bayes algorithm has proven to perform well in several studies. Gupta et al. [36] utilised the supervised learning and edit distance method. The result showed that combining edit distance and Naïve Bayes on the N-gram Markov model could perform well, particularly when detecting language from misspelt words. Lakshmi and Shambhavi [43] employed two different Naïve Bayes algorithms, Multinomial and Bernoulli Naïve Bayes. The Bernoulli Naïve Bayes combined with TF-IDF, and dictionary module outperformed the other methods (SVM, Random Forest, and Logistic Regression) with accuracy, precision, and recall of 94.8%, 96.3%, and 95.2%, respectively. A study by Kalita et al. [62] showed that Naïve Bayes outperformed Decision Tree and Multilayer Perceptron, achieving F1 of 65.9%, precision of 76.2%, and recall of 69.3%.
Logistic Regression was applied in eight of the selected studies. Bansal et al. [16] used Logistic Regression for English-Punjabi code-mixed LID. Based on the experiments, Logistic Regression outperformed Decision Tree and Gaussian Naïve Bayes in word-level code-mixed LID with an accuracy of 86.63% and an F1 score of 88%.
Random Forest was also one of the popular machine learning techniques implemented in eight studies. Among these studies, Shanmugalingam and Sumathipala [3] revealed that Random Forest performed the best among the other machine learning techniques with an accuracy of 90.5% for wordlevel Sinhala-English code-mixed LID. Based on their experiments, the Random Forest model could identify Sinhala and English languages quite well, with an F-measure of 94.9% for Sinhala and 75.8% for English. However, the Random Forest model yielded unsatisfactory results with an F-measure of 51.3% for tokens other than Sinhala and English, such as named entities, acronyms, universal, mixed, and other language tags.
For the neural network-based, we found the following various neural-network techniques, such as Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and some variants of RNN, such as Long Short-Term Memories (LSTM), Gated Recurrent Unit (GRU), Bidirectional LSTM (BLSTM), and Segmental Recurrent Neural Network (SegRNN).
Mager et al. [54] and Sabty et al. [65] proposed SegRNN in their studies to address sub-word LID for intra-word codeswitching. The SegRNN models appear to perform better on language pairs with more intra-word CS, whereas pipeline approaches may perform equally well on language pairs with fewer mixed words [54]. In [65], SegRNN provided high results compared to Naïve Bayes and BLSTM. The SegRNN obtained a 94.84% F1 score for intra-word labelling and 99.17% for segmentation. VOLUME 10, 2022 Another RNN variation technique, LSTM, has shown satisfactory performance in identifying Hindi-English and Bengali-English code-mixed text [23], [51], [53]. In [51], the LSTM architecture could give a high average F1 score of 93.4% and an average accuracy of 96.1% across the three classes. LSTM with pre-trained word embeddings outperformed CRF and BLSTM in Bengali-English code-mixed LID [53]. Samih et al. [40] experimented with LSTM and CRF to improve the LID performance. The results showed that integrating character and word representation with a char-word LSTM and adding CRF produced the highest overall accuracy of 96.3%.
In text processing problems, CNN is often implemented to extract text features before applying a machine learning algorithm. The convolution layer in CNN building blocks extracts text features by applying a convolutional filter or kernel to each window in the sequence of text. Sarma et al. [63] experimented with CNN and BLSTM, and CNN showed the best performance among the other techniques with an F1 score of 91.03%. Some studies combined two ANN modules, CNN and LSTM or CNN and BLSTM, in their neural network architecture.
Jaech et al. [37] incorporated the CNN and BLSTM for word-level LID in Spanish-English code-mixed text. In their architecture, the convolutional layer provides word vectors from the input characters by transforming them into vectors. Next, the BLSTM maps the word vector sequence to a language tag. BLSTM was selected due to its capability to capture long sequence dependencies. In BLSTM, the context of the observed sequence will be considered during the word-level identification process. The result achieved an F1 score of 95.1% for English and 94.1% for Spanish.
We encountered three research by [11], [60], and [61] for the hybrid technique that combined the non-neural network and neural network techniques. They proposed similar architecture consisting of two modules: a multichannel neural network (MNN) and BLSTM-CRF. The MNN comprises three one-dimension convolution layers and one LSTM layer. One-dimension convolution layer cells were used to capture the N-gram representation of the input text. Additionally, the BLSTM-CRF module aims to capture the context of the input text.
Mandal and Singh [11] experimented on two code-mixed data (Bengali-English and Hindi-English). They implemented 2, 3, and 4 kernels in their multichannel architecture to seize the N-gram representations. Their study revealed that the combination of multichannel and BLSTM-CRF achieved an accuracy of 93.28% and 93.32% for Bengali-English and Hindi-English severally. Another work by [60] identified Hindi-English code-mixed text from social media platforms. In their work, they gained the best F1 score of 93.97%. Gupta et al. [61] implemented 3-gram word embedding in their MNN-BLSTM-CRF architecture. In their work, they acquired the best result with an F1 score of 93.97%.
With the advent of the transformers-based technique proposed by Vaswani et al. [69], the transformers have quickly become a popular and reliable technique for natural language processing, outperforming the prior neural networks such as CNN and RNN [70]. Recent work by Thara and Poornachandran [64] presented a transformer-based technique called Bidirectional Encoder Representations from Transformers (BERT). The authors built a word-level code-mixed LID system in Malayalam-English. They experimented with five BERT model architectures; BERT, DistilBERT, ELECTRA, XLM-RoBERTa, and CamemBERT. Overall, ELECTRA performed the best, with an F1 score of 99.33% and an accuracy of 99.41%.
Moreover, three out of 40 selected studies were found to be using the unsupervised approach. Nguyen and Cornips [42] carried out a study on Dutch-Limburgish using an unsupervised morphological segmentation approach. They utilised Morfessor tools to analyse text by slicing the words into smaller units.
Phadte and Wagh [44] built a word-level LID for Konkani-English text by applying an unsupervised approach with dictionaries. They compared the unsupervised dictionary-based technique with other supervised LID, such as CRF, SVM, and Random Forest. The supervised approach performed better than the unsupervised dictionary-based technique based on the experiments. Rijhwani et al. [45] presented an unsupervised word-level code-mixed LID for seven different language pairs without manual data annotation for the training process. The automatic annotation process was carried out using the Hidden Markov Model (HMM) and dictionary as baselines. In their study, the HMM was implemented in an unsupervised manner.

b: NON-MACHINE LEARNING
Kent and Claeser [22] incorporated a dictionary-based approach in a rule-based code-mixed LID system. They mentioned that there would be many word misclassification if the system relied on a basic dictionary without adding rules. Kasmuri and Basiron [57] proposed a rulebased approach, and their research focused on distinguishing between code-switching and monolingual sentences in an English-Malay dataset. The rule-based approach was used with five dictionaries as look-up tools to identify the language in their work. The rule-based solution employed a ratio of word presence in a phrase with a 90% threshold for monolingual communication and various codes-switching ratios. The study has shown that the rule-based technique produced a good performance with more than 87% accuracy. Nguyen et al. [66] employed a rule-based approach for Hindi-English code-switched LID. Their study utilised a word list for each language to help identify the language for each token.
Although machine learning techniques are widely utilised, we found some studies that applied lexicon-based using dictionaries to solve code-mixed LID. From the examined papers, we encountered one study that applied a dictionarybased technique. Claeser et al. [48] proposed a lexicon-based classification using Wikipedia as a dictionary to identify code-switching from Twitter data. Their dictionary-based technique has correctly identified the language of word sequences and abbreviated words, such as 'jajaja' and 'omg'. However, the system could not determine some irregular tokens.
In our investigation, we found an issue of ambiguity in the dictionary look-up technique. For instance, the language tag is sometimes incorrect when a particular word exists in more than one dictionary [36]. Some of the selected studies utilised dictionaries together with machine learning techniques. In the previous studies, the dictionaries were employed as features. The discussion regarding dictionaries as features will be presented in the features section.

2) FEATURES
Selecting appropriate features is crucial for enhancing the performance of code-mixed LID systems [3]. Feature extraction enables us to generate more accurate data, which the model will produce good results. We listed some essential features used by researchers for code-mixed LID, such as N-gram, word embeddings, and dictionary features. Based on the reviewed literature, most studies implemented more than one type of feature to solve code-mixed LID tasks.

a: N-GRAM
N-gram was used as a feature in 16 out of 40 studies. We found two different N-gram techniques being applied from the selected studies: word or token N-gram and character N-gram. Word or token N-gram has been used by [20], [45], and [62]. In the word-level code-mixed LID task, the character N-gram is more popular than the word N-gram, especially for identifying the language in code-mixed script [15], [16], [58]. In Piergallini et al. [39], it was observed that the use of the Swahili regular expression with the character N-gram was redundant. They suggested utilising N-gram features and capitalisation features to improve the LID performance. The capitalisation feature identifies capitalised initial letters and whether they occurred at the beginning of a sentence. Veena et al. [46] stated that an adequate number of embedding data could be sufficiently applied to develop word-level features for code-mixed LID. Additionally, a few studies observed that the character Ngram successfully solved LID for code-mixed text [52], [56].

b: WORD EMBEDDING
The word embedding or word vector representation technique could help represent word similarity and context from texts. Xia [41] trained sub-word information of the input text using enhanced skip-gram word vector models. An experiment by [47] has proven that the skip-gram could improve the performance of their LID system. Sarma et al. [18] employed word embedding to help detect the language of a new word. In Jamatia et al. [53], the pre-trained word embeddings could improve the performance of the proposed codemixed LID system. Another study by Shekhar et al. [60] and Gupta et al. [61] demonstrated that word embedding performed better than the character embedding technique.
The word embeddings can identify language separation by detecting the word origin and mapping it to the correct language label.

c: CHARACTER EMBEDDING
Character embedding is an embedding feature vector generated by splitting words into characters [46]. Applying character embedding for code-mixing LID can capture the morphological features of the words and make them more sensitive toward out-of-vocabulary problems [15], [71]. Mandal and Singh [11] employed character embeddings of length 15 fed into the multichannel neural network layer. In [46], the vector size of character embedding was set to 100. Mave et al. [15] combined character and word embedding representations by applying two LSTM layers. The LSTM layers were used to train fixed-dimensional representations from the embedding layers. In [60] and [61], character embedding was also employed with word embedding.

d: TF-IDF (TERM FREQUENCY-INVERSED DOCUMENT FREQUENCY)
TF-IDF is the most advanced count vectorizer technique to convert text data into a form of vector as an input to the classifier [58]. In natural language processing, TF-IDF is frequently applied with N-gram features. Smith and Thayasivam [56] trained their model by employing TF-IDF into several types, word-level TF-IDF, character N-gram TF-IDF, and N-gram TF-IDF. Mishra and Sharma [55] adopted TF-IDF to model the context of the sentence in a particular discussion. However, the TF-IDF feature has less impact on performance than N-gram features [58].

e: DICTIONARY FEATURES
Treating the dictionary as a feature has been an effective method applied in code-mixed LID. Piergallini et al. [39] utilised English and Swahili dictionary features combined with two other features: capitalisation and regular expression features. Kalita and Saharia [20] incorporated three different dictionaries with N-gram features. Bansal et al. [16] used dictionaries as features to express the presence of words.
To summarise the RQ1, we have demonstrated a taxonomy of applied techniques for code-mixed LID, highlighting the different technique variants. We identified that machine learning, mainly supervised approaches, is more widely used than the non-machine learning approaches to solving code-mixed LID problems. SVM and CRF are the most popular and recommended non-neural network techniques. For the neural network-based technique, the Multichannel CNN-BLSTM-CRF has proven excellent performance. Due to its impressive performance, we came across the transformer-based technique as a robust technique for code-mixed LID. In terms of applied features, we obtained four crucial features from the reviewed primary studies: N-gram, word embedding, dictionary features, and TF-IDF.

B. (RQ2) WHAT ARE THE CHALLENGES IN LID OF CODE-MIXED TEXT?
Detecting mixed-language text is a hot topic in natural language processing research. Most existing language detectors do not identify mixed-language texts. Identifying multiple languages in code-mixed text requires different techniques from multiple languages applied to a document [20]. Moreover, code-mixed text that uses various levels of combinations (sentence, clause, word, and sub-word) makes LID more complicated [15] than text expressed in one language. Using dictionary look-up, identifying language from code-mixed text has shown poor performance due to spelling variations, losing the context of the words, and failure to differentiate some borrowed words [22], [72]. Accordingly, conducting LID in the code-mixed text is more challenging than in the non-code-mixed text.
For bilingual code-mixed, addressing ambiguity is challenging in language annotation when a particular word exists in two or more languages. This phenomenon has two conditions: a single word with the same meaning and a single word with different meanings in two or more languages. A problem exists if the word has a different meaning for two or more languages because the meaning would be different based on the language identified by the system. Moreover, language ambiguity makes trilingual code-mixed text more challenging to identify than bilingual code-mixed text [53]. Table 6 shows examples of word ambiguity for several language pairs.

2) LEXICAL BORROWING
Another challenge related to ambiguity is lexical borrowing [63]. Lexical borrowing is defined as transferring or copying a particular lexical item from one language to the lexicon of another language [22], [74]. We identified two examples of lexical borrowing words in Dutch-English and Hindi-English language pairs. For example, the word sociaal (Dutch) and social (English) [22]; pajama (Hindi) and pyjama (English) [11]. It can be seen from the examples that the words have almost similar spelling. In this case, the LID system may identify similar words to the correct language tag for words with phonetic similarities but different spellings.
However, an issue arises when the words have the exact lexicon similarity. Due to such lexicon similarity, it is difficult for a language detection system to distinguish between code-switching and borrowing words. The exact similarity in the spelling of a particular word means that the word is valid in multiple languages. Therefore, the correct tag of the word will depend on the context, which is the other surrounding words. For instance, the word 'school' is valid in Dutch and English [22]. In Twitter, the system will detect the word 'school' as Dutch instead of English if it is surrounded by Dutch words and vice versa.

4) INTRA-WORD CODE-MIXING
Since word-level code-mixed LID becomes a common task, determining code-mixed at the sub-word level is a more demanding task. Only a few studies have addressed intra-word code-mixing issues [65]. Intra-word code-mixing occurs when speakers incorporate languages within a token or word [38], [54]. This happens when a prefix or suffix from one language is added to another language. Table 8 provides examples of intra-word code-mixing. In the Indonesian language, people sometimes add to an English word an informal prefix (nge-) in a verb or suffix (-nya) in a noun [7]. Similar examples can also be found in other language pairs, such as Assamese-English [20], Bodo-English [62], Spanish-Wixarika [54], Dutch-Limburgish-English [42], Turkish-English [52], Turkish-German [48], Telugu-English [49], Kannada-English [43], and Manipuri-English [38].
To sum up the RQ2, we have identified four main challenges often found in code-mixed LID tasks: ambiguity, lexical borrowing, non-standard words, and intra-word codemixing. These challenges are prevalent in social media text and becoming a problem in code-mixed LID.
Ambiguity happens when a particular word or token is recognised in two or more languages. There are two issues relating to ambiguity, a word with a similar meaning and a word with a different meaning for two or more languages. Another challenge identified is lexical borrowing. In this study, the problem of lexical borrowing arises when a word has the exact spelling. For non-standard words, four challenges have been identified: non-standard spelling, mixing between word and numeric or special characters, word exaggeration, and abbreviated words. The last challenge is intraword code-mixing which occurs when two or more languages are mixed in a word or token. In this section, we analysed the code-mixed dataset based on four perspectives: (1) the number of mixed languages in the datasets, considering only datasets that are bilingual or multilingual, (2) the datasets that are English code-mixed or non-English code-mixed, (3) the language family combination, and (4) source of datasets.  Figure 3 illustrates the data distribution from the selected studies. From the left side (a), bilingual code-mixed data dominates with 87.5% (28 datasets), while the percentage of multilingual data is only 12.5% (4 datasets). The multilingual code-mixed data include Bengali-Hindi-English, English-Assamese-Hindi-Bengali, English-French-Italian-Spanish, and Gujarati-Hindi-English. As shown in Figure 3 (b), the ratio between English code-mixed and non-English code-mixed is 84.3% (27 datasets) and 15.6% (5 datasets), respectively. All non-English code-mixed data are bilingual, and these data are of the language pairs; Arabic-Modern Standard Arabic, Dutch-Limburgish, Dutch-Turkish, Spanish-Wixarika, and Turkish-German.
Among the 32 datasets, the Hindi-English was the most frequent language pair with 9 studies. Spanish-English is the second most studied language pair with 5 studies, followed by Bengali-English and Dutch-English with three studies. Two studies each focused on the following mixed languages: Turkish-English, Turkish-German, Malayalam-English, Tamil-English, Assamese-Hindi Bengali-English, Sinhala-English, and Arabic-Modern Standard Arabic.
We also grouped the available code-mixed dataset based on the language family combination. To identify the language family, we referred to a study conducted by [75]. Overall, we found 12 language family combinations as follows: Most of the code-mixed data were combined with English, which belonged to the Germanic language family. Germanic is a part of Indo-European languages and is mainly spoken in the north of Europe, such as in England, Germany, and the Netherlands [76].
The most studied language family was the combination between Indo-Aryan and Germanic with ten language mix combinations. Indo-Aryan is a branch of the Indo-European language spoken mainly by people in South Asia [77]. The Indo-Aryan language family consists of Assamese, Bengali, Gujarati, Hindi, Konkani, Punjabi, and Sinhala. The Dravidian and Italic mixed with Germanic language families were the second most studied in the dataset with four language mix combinations. From the investigation, we found some datasets categorised as part of the Dravidian language family, such as Kannada, Malayalam, Tamil, and Telugu language [77].
As for the Italic language family, we acquired French, Italian, Spanish, and Portuguese. We identified three mixed language combinations belonging to the combination of the Germanic language family: Dutch-English, Dutch-Limburgish, and German-English. In the Austronesian language family, we discovered Indonesian and Malay intermingled with English. We found two languages, Bodo and Manipuri, which are classified as the Sino-Tibetan language family. In terms of the Trans Eurasian family, we found Turkish, which was mixed with English and German. We also identified one language family combination for Germanic & Trans Eurasian, Italic & American, Niger-Congo & Germanic, Semitic & Germanic, and Semitic & Semitic. Table 9 presents the code-mixed dataset grouped by the language family combination.
We encountered eight unique data sources from the inspected 32 datasets, such as Twitter, Facebook, WhatsApp, YouTube comments, chat messages, blogs, frequently asked questions data, and interviews and internet forums. Twitter is the most used platform with 24 studies, followed by Facebook (21 studies), WhatsApp (8 studies), and YouTube comments (2 studies). Chat messages, blogs, FAQ data, interviews and internet forums were utilised in one study, respectively. Table 10 shows the source of code-mixed LID datasets from the investigated papers.

2) CODE-MIXED DATASET QUALITY CRITERIA
Good quality data is necessary for conducting research. We attempted to determine the properties representing the quality of code-mixed data. To identify the quality criteria of the dataset, we applied the study by Jose et al. [2]. A set of items was defined, including the number of instances, percentage of code-mixed data, number of tokens, number of unique tokens, and average sentence length.
The number of instances and the number of tokens indicate the size of the corpus. The number of words provides further insight into the corpus's structure, especially for language tagging tasks, such as identification, named entity recognition, and POS tagging. The percentage of code-mixed data shows the diversity of code-mixed, code-mixed types, and the ratio of the types in the entire dataset. The quantity of unique tokens represents the vocabulary size of the dataset. This allows us to discern the richness of text in the data. Finally, the average sentence length indicates completeness and grammatical complexity since the longer the sentence, the more complicated the syntactic and semantic structure [2]. From the examined studies, we found that all the studies described the number of instances except three studies: Bodo-English [62], Kannada-English [43], and Punjabi-English [16]. The number of tokens is presented by 23 out of 32 datasets. This high ratio indicated the importance of information regarding the number of instances and tokens from a particular dataset. Eleven studies presented a percentage of code-mixed from the dataset. We observed that the number of unique tokens was reported in 6 codemixed datasets from 5 papers, such as Assamese-English [20], Bodo-English [62], Hindi-English [15], Sinhala-English () [3], Spanish-English [15], and Tamil-English [50]. Moreover, papers by [42] and [59] reported the average sentence length in their study. Table 11 provides the quality criteria of the datasets.
Further, 2 out of 32 datasets fulfil 4 out of 5 criteria. Kalita and Saharia [20] described their Assamese-English with 1,012 instances, 227,329 tokens, and 5,977 unique tokens. They presented the percentage of the code-mixed based on three groups: intra-sentential (26.69%), inter-sentential (69.26%), and intra-word (4.05%). Mave et al. [15] provided four criteria for Spanish-English and Hindi-English language pairs. Unfortunately, both studies by [20] and [15] did not report the information regarding the average sentence length of their dataset.
Two main aspects have been discussed in the RQ3, dataset availability and dataset quality criteria. This literature review study has recognised 32 code-mixed datasets available for LID. The bilingual code-mixing datasets dominate 87.5%, and only a few multilingual datasets are available for codemixed text. In addition, we discovered that 84.38% of datasets are mixed with English. This finding is acceptable since English is an international language and has become an integral part of the education system in many countries. Moreover, 10 of 32 datasets (31.25%) were the Indo-Aryan language families combined with English.
We also proposed five features to describe the quality criteria dataset. The features are the number of instances or sentences, percentage of code-mixed types in the data, number of tokens, number of unique tokens, and average sentence length. Those five items can be used as a standard criterion for researchers to build an excellent quality of the created dataset for future research. We unified the methodologies studied through our literature analysis and proposed a framework for researchers to use.
The framework consists of two parts, model development and a code-mixed LID system. The model development generates a classification model. The code-mixed LID system predicts language labels from the input. The model development is divided into seven stages: data collection, pre-processing, data annotation, quality criteria assessment, feature extraction, classification modelling, and evaluation. In the data collection stage, the language pair of interest is chosen. The code-mixed data is gathered by defining keywords or topics to search data from various sources, such as reviews, chats, social media, and speech transcription. The collected data is then stored in a storage.
Subsequently, data pre-processing tasks are carried out by removing duplicates or irrelevant data. The tokenisation task is then conducted by splitting text data (sentences, tweets, comments, or documents) into words. Moreover, case-folding can be applied to convert the words into the same case form, like lowercase.
The next stage is data annotation. Data annotation is one of the essential processing tasks in a language identification system [16]. Before annotating the data, we must first define the labels for the dataset. The annotation process can be done manually [65], [78] or semi-automatically. A shared task and crowdsourcing are the most common methods for manual data annotation. The semi-automated method combines manual annotation with a dictionary or machine learning techniques. Before moving on to the next stage, we must evaluate the data to ensure that our labelled data is valid. In addition, a quality criteria assessment is conducted in this stage to provide the excellent quality of the dataset.
The transformed texts are then processed in the subsequent stage, classification modelling. In our framework, the codemixed LID is a classification problem where every word is labelled to its corresponding language tag [78]. The classification modelling process aims to derive conclusions from the training data and predict the class label. Training, validation, and testing sets are sampled from the dataset in this stage. The training set is a subset of data fed into any machine learning or deep learning algorithm to uncover the dataset's hidden patterns. The validation set assesses the trained model, and the results from validation are used for fine-tuning until the best result is achieved. The model's performance is then determined by evaluating the best result on the testing set. We can use evaluation metrics, such as accuracy, precision, recall, and F-score, to assess the model's performance.
Finally, the best model generated from the classification modelling stage is used as a classification model for the codemixed LID system. The system receives user input in a word,  sentence, paragraph, or document. The input is tokenised and transformed before it is fed into the classifier model. Tokens with the corresponding predicted labels are the system's output. Figure 4 depicts a comprehensive picture of the framework for code-mixed LID.

IV. IMPLICATIONS A. THEORETICAL IMPLICATIONS
This systematic review contributes to the theoretical advancement of the code-mixed LID. First, this study identifies the techniques utilised to solve code-mixed LID problems. For the non-neural network techniques, SVM and CRF algorithms are the most recommended techniques. MNN-BLSTM-CRF can be considered an alternative technique for future studies due to its excellent performance. Moreover, transformer-based techniques have demonstrated more impressive performance than neural network-based and machine learning models [79], [80]. Transformers-based is a context-sensitive embedding method that can perceive the word from its context. A language identification system using such a technique proposed by Thara and Poornachandran [64] demonstrated impressive performance for Malayalam-English code-mixed data. Other bilingual and multilingual code-mixed data can be evaluated using such a technique.
Second, previous studies typically built the code-mixed LID model using a supervised approach. However, we need sufficient annotated data to build the dataset for a supervised approach. Since humans carry out the data annotation process, the human annotation process is time-consuming and exhausting, especially for large datasets. Developing a large dataset can be conducted using the pseudolabelling technique for future research. Pseudo-label is a part of semi-supervised learning for labelling more unlabelled datasets using a small number of labelled data [81], [82]. Additionally, the pseudo-label technique can improve the model performance [83].
Third, we have investigated the four main challenges in code-mixed LID: ambiguity, lexical borrowing, out-ofvocabulary, and intra-word code-mixing. These challenges in the code-mixed text may lead to incorrect language predictions in the LID system. Generally, inaccurate word tag predictions may be caused by the following factors; rare and noisy word forms and noisy context [63]. Neighbouring words that express the context of a text's body is critical to identify the words' language correctly. Therefore, future studies are encouraged to develop a code-mixed LID system capable of dealing with the challenges. For example, combining information from external resources such as dictionaries and knowledge bases can solve these challenges and enhance the LID system's performance [63].

B. PRACTICAL IMPLICATIONS
This study offers notable practical implications for subsequent researchers in their future studies. First, most previous researchers claimed to have obtained a good LID performance in terms of accuracy, precision, recall, or F1 score while using non-standardised datasets to train their classifiers. Therefore, the results from such studies cannot be directly compared and may be unreliable. Furthermore, current LID systems evaluated on one dataset may be tailored for this dataset only and cannot be generalised to other datasets. Therefore, standardisation of code-mixed datasets as a benchmark for LID is needed.
Second, the opportunity to develop a new code-mixed dataset is still widely open. Our findings showed high percentages of bilingual code-mixed data, especially English code-mixed text. Also, most of the available datasets are dominated by Indo-Aryan language families. We observed research opportunities in the following code-mixed data: multilingual data, building non-English code-mixed data and building code-mixed data for low-resource languages. These can help make toward a standard for evaluating code-mixed LID systems.
Third, we incorporated our findings into a conceptual framework for developing a code-mixed LID model. In the future, the framework can be beneficial for developing theories, conducting empirical research, and practical application in code-mixed LID-related studies. The framework provides general steps that can be used as a standard practical guideline for budding researchers in code-mixed LID studies. To build a code-mixed LID for new languages, researchers should pay more attention to the following things in the framework: preprocessing, data annotation, feature extraction, and classification modelling. Pre-processing helps to select the relevant things from the raw dataset. Data annotation determines the identified labels. The feature extraction process assists in extracting the necessary part of the texts. Finally, researchers apply the designed training scenario in classification modelling to gain the best model.

V. CONCLUSION
This systematic literature review has presented the current state of studies in code-mixed LID and proposed a framework for future research. This review included 40 primary studies published from 2016 until 2021. Three main aspects of LID for code-mixed text were investigated, e.g., 1) techniques, 2) challenges, and 3) data availability with corresponding quality criteria.
Findings revealed that in some neural network-based studies, the multichannel CNN incorporated with BLSTM and CRF had shown excellent performance in solving code-mixed LID problems. As for the non-neural network techniques, SVM and CRF are recommended to be applied. Due to its remarkable performance, the transformed-based technique can be considered one of the most robust techniques for code-mixed LID. Subsequently, we encountered four significant challenges in code-mixed LID tasks: ambiguity, lexical borrowing, non-standard words, and intra-word codemixing. From the examined papers, this study identified 32 code-mixed datasets for LID containing 87.5% (28 studies) bilingual and 12.5% (4 studies) multilingual. Among the 32 datasets, the ratio between English code-mixed and non-English code-mixed is 84.3% (27 datasets) and 15.6% (5 datasets). Furthermore, this research setting defined five quality criteria to determine dataset quality evaluation as a benchmark to generate quality datasets for future studies. Finally, based on a detailed analysis of the recent literature and following the systematic approach, a framework for code-mixed LID is developed as a standard guideline for researchers.