Bangla-BERT: Transformer-Based Efficient Model for Transfer Learning and Language Understanding

The advent of pre-trained language models has directed a new era of Natural Language Processing (NLP), enabling us to create powerful language models. Among these models, Transformer-based models like BERT have grown in popularity due to their cutting-edge effectiveness. However, these models heavily rely on resource-intensive languages, forcing other languages into multilingual models(mBERT). The two fundamental challenges with mBERT become significantly more challenging in a resource-constrained language like Bangla. It was trained on a limited and organized dataset and contained weights for all other languages. Besides, current research on other languages suggests that a language-specific BERT model will exceed multilingual ones. This paper introduces Bangla-BERT,a a monolingual BERT model for the Bangla language. Despite the limited data available for NLP tasks in Bangla, we perform pre-training on the largest Bangla language model dataset, BanglaLM, which we constructed using 40 GB of text data. Bangla-BERT achieves the highest results in all datasets and vastly improves the state-of-the-art performance in binary linguistic classification, multilabel extraction, and named entity recognition, outperforming multilingual BERT and other previous research. The pre-trained model is assessed against several non-contextual models such as Bangla fasttext and word2vec the downstream tasks. Finally, this model is evaluated by transfer learning based on hybrid deep learning models such as LSTM, CNN, and CRF in NER, and it is observed that Bangla-BERT outperforms state-of-the-art methods. The proposed Bangla-BERT model is assessed by using benchmark datasets, including Banfakenews, Sentiment Analysis on Bengali News Comments, and Cross-lingual Sentiment Analysis in Bengali. Finally, it is concluded that Bangla-BERT surpasses all prior state-of-the-art results by 3.52%, 2.2%, and 5.3%.


I. INTRODUCTION
Pre-trained language models based on the transformer architecture have become an absolute standard for state-of-theart performance on a wide variety of natural language The associate editor coordinating the review of this manuscript and approving it for publication was Diego Oliva . a https://huggingface.co/Kowsher/bangla-bert processing applications [1].BERT, a renowned transformerbased technique, brought a great revolution that had huge impacts on the evolution of NLP [2]. Since its release as an academic research paper, this technologically pioneering NLP model has amazed the AI world. It's the first-ever deeply bidirectional and fully unsupervised technique for language representation that was pre-trained just using a plain text corpus [3]. Numerous advancements are happening VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ in BERT nowadays. One notable modification over BERT by Facebook is named ROBERTA, which uses a more robust architecture with massive computational power and an enormous dataset [4]. Another method invented called XLNet was inspired by BERT's autoregressive formation [5]. Both these models require substantial computational power, which becomes a problem for a particular aspect. For this power with less computation, another comprehensive model comes from BERT, compromising only 5% performance degradation named DistilBERT [6]. Despite its small size, it gives a faster performance, and DistilBERT results in almost identical performance on similar tasks. Google announces ALBERT, a lite version of BERT. Even though it has fewer parameters than BERT, it produces significant outcomes [7]. BERT establishes its supremacy over all other language processing units. The authors included a ''multilingual'' version of BERT(mBERT) pre-trained on the Wikipedia articles of 104 distinct languages, including the Bangla language, to serve as a resource for languages other than English. This renowned variation of BERT emphasizes the contextual representation for several multilingual tasks [8]. This model showed promising results and obtained state-of-the-art performance on cross-lingual benchmarks by optimizing for language-specific tasks. This consequence yields a wave of implementing BERT on monolingual data. The monolingual implementation of the BERT model for a resource-constrained language like Bangla can create a new era of language modeling for Bangla. The majority of the latest BERT models are only available in English and other resource-rich languages such as Chinese, Arabic, and Spanish. When it comes to low resources like Bangla, it is still at the bottom of the heap, and it leads to the lack of availability of many downstream task datasets and pre-trained language models. Additionally, mBERT, trained in 104 languages, has two significant gaps. It was prepared using only relatively structured and limited language data from Wikipedia, and another being the aggregate weights of all 104 languages. This article has addressed those deficiencies and proposed a monolingual BERT for Bangla language based on a developed large Bangla language dataset (BanglaLM). 1 In addition, this paper also discusses the process of the pretraining architecture of the BERT transformer model for Bangla, which we refer to as Bangla-BERT. This model has been trained from scratch, and its performance is compared to mBERT and other Bangla pre-trained word embedding models on some published datasets for sentiment analysis, binary and multilabel text classification, and NER. We have developed the largest Bangla language modeling dataset to train the proposed model. The dataset is 40 GB, with three variants containing around 20 million samples for each variant. The proposed model has been trained on a substantial quantity of unsupervised developed data(BanglaLM) before 1 https://www.kaggle.com/gakowsher/bangla-language-model-dataset fine-tuning. However, it is initiated using the parameters that have already been trained before utilizing labeled data from downstream tasks. Most of the research in Bangla didn't examine the power of the transformer. Furthermore, none of them used an extensive dataset for any pre-trained model as the resource is constrained.
This work examines the potential by fine-tuning a range of Bangla downstream tasks. The proposed pre-trained model has been compared to a range of non-contextual neural models, including Bangla fasttext 2 (skip-gram and CBOW models) and word2vec. We used some NLP datasets for the proposed BERT model's performance analysis. We compared them with classical machine learning and hybrid deep learning models [9], including LSTM [10], CNN [11], CRF, and proved that the Bangla-BERT model outperformed them. When we compared the outcomes to the current state of the art in performance, Bangla-BERT came out on top [12]. As a summary, our contributions are as follows: • This work proposes a massive Bangla unsupervised language dataset (BanglaLM) for language modeling.
• This paper presents the whole mechanism for pre-training the context-aware BERT model using BanglaLM.
• This work includes training a language model with the largest dataset ever created for Bangla and exploring the possibility of fine-tuning a transformer model for a low-resource language like Bangla.
• This work resolves the mBERT's limitation for Bangla (trained on limited and more structured data only) and mixed weights issues among 104 languages.
• We examine Bangla-BERT and show its effectiveness on four NLP downstream tasks: Sentiment Analysis, Named Entity Recognition, Binary, and Multilevel Text Classifications. Apart from that, compared to mBERT and other non-contextual models such as Bangla fasttext (including skip-gram and CBOW models), word2vec, in these downstream tasks, we showed that the proposed model outperformed them all by a wide margin.
• We make Bangla-BERT available on popular site Huggingface so that it can be adopted as the new baseline and to advance Bangla NLP research. The rest of this paper is organized as follows. Section 2 contains a brief overview of the previous research on language representation. The architecture of BERT is described in Section 3, and the method used to construct Bangla-BERT is described in Section 4. Section 5 illustrates the technique for developing vocabulary. The following section describes the downstream NLP tasks and benchmark datasets in-depth and the acquired results. Section 7 details the comparison with the previous work. Section 8 discusses future work, and section 9 concludes this paper.

II. RELATED WORK
Word2vec [13] has begun the modern era of language processing and was proposed in 2013 to find the most meaningful word representations. As the transition continues, The follow-up of word2vec appears as GloVe [14] and fast-Text [15]. While none of these models contained contextspecific knowledge, ELMo [16] dealt with the issue. A similar LSTM-based model, ULMFit [17], sparked a revolution by including the transfer learning method into NLP tasks. While they perform admirably in polysemy, the input method poses a problem. Since Elmo demands character-based input, Ulm-Fit's word-based input reveals a lack of vocabulary, which BERT resolves through sub-words solutions. ELMo takes a concatenated approach to both directions, but UlmFit operates with a unidirectional approach [18]. BERT outperforms both of these methods due to its bidirectional strategy. However, an identical transformer-based architecture known as GPT with a unidirectional topology emerges. It undergoes three (3) transformations over two years. Though the initial model required fine-tuning, the latest model, the GPT-3, does not. Additionally, Open Ai, the originator of GPT, permits it exclusively in a commercial context through API, but Google revolutionizes the NLP industry by making the BERT an open-source model [19], [20], [21].
When Google published the 104-language mBERT model, it generated great interest. As in 2019, [22], [23] demonstrated the effectiveness of mBERT by experimenting with it on various natural language processing tasks. This phenomenon led researchers to work on the BERT model, which can capture multiple languages. Simultaneously, [24] published a model for cross-lingual BERT that includes 15 languages for cross-lingual signal exploitation. [25] compared monolingual versions of BERT (English and German) to mBERT. It has been stated that mBERT performed more challenging tasks poorly regarding language generation. It is one of the factors contributing to the recent wave of BERT monolingualism. BERT significantly empowers monolingualism. Through its monolingual implementation, this model not only achieves cutting-edge efficiency but also establishes a benchmark in numerous languages by utilizing an attention strategy. Though BERT is English-centric, pushing other languages to resource-constrained multilingual models (mBERT), researchers and scientists work to include this into their language. Their substantial contribution shows CamemBERT [26] and FlauBERT [27] for French, BERTje [28], and RobBERT [29] for Dutch, AraBERT [30] for Arabic, AlBERTo [31] for Italian, PersBERT [32]for Persian, FinBERT [33] for Finnish, and other 30 languages such as Chinese [34], Spanish [35], Romanian [36], Russian [37], etc. Even though these languages are highly diverse, the context-based approach yields promising results in all variants.
The evaluation of self-attention mechanisms departs from conventional recurrent architectures, consisting of token prediction followed by masked language modeling (MLM) [38], [39]. In a masked language modeling (MLM) test Goldberg [39] shows how BERT consistently assigns higher scores to correct verb forms than to incorrect verb forms.NSP (next sentence prediction), a binary classification task that allows the model to capture relationships between phrases easily, is another of BERT's underlying mechanisms [28]. Another robust upgrade of BERT, RoBERTa, eliminates the NSP task from the dynamic masking instead of static masking [4]. Rather than relying entirely on BERT's internal architecture, researchers are increasingly likely to exploit this. For example, while the French and Spanish imply a dynamic masking method, the Chinese take it a step further by implementing a novel whole-word masking method recently developed by the inventor of BERT [26], [34], [35].
Researchers created these models with less difficulty due to their language's abundant resources. However, we encounter many obstacles compared to their work because the Bangla language has limited resources. We address this as well as the mBERT issue of accumulating the weight of additional languages and the problem of limited, organized data.

III. BERT ARCHITECTURE
Since Universal Language Model Fine-tuning (ULMFIT) launched, transfer learning has instantly established the gold standard for state-of-the-art results in NLP-related tasks. Following then, significant advancements have been made by merging the Transformer with transfer learning. OpenAI's GPT and Google AI's BERT are two notable examples of this coupling. The Encoder utilized in BERT is an attention-based Natural Language Processing (NLP) architecture introduced a few years ago in the paper Attention Is All You Need. The paper presents the Transformer architecture, which is formed of two components: the Encoder and the Decoder. Since BERT only employs the Encoder, we will discuss that in this paper. We will look at the Encoder architecture outlined in Attention Is All You Need. Then, in BERT Specifics, we will inquire into the innovative alterations that contribute to the effectiveness of BERT.

A. INPUT EMBEDDING
The input is processed in three stages: tokenization, mapping tokens for numeric representation, and embedding. Following tokenization, each token is mapped to a distinct integer of the corpus vocabulary, referred to as mapping tokens. Each token obtains a unique numeric representation. Besides, padding is required to ensure that the input sequences in a batch are identical in length. Tokenization, mapping, and word embeddings all refer to the process of converting words to vectors, which is similar to how neural word embedding accomplishes it. Given the following toy sentence: ''Bangladesh is a beautiful country.'' To begin, tokenizing it: ''Bangladesh is a beautiful country.'' Then we get tokens as -[ ''Bangladesh,'' ''is,'' ''a,'' ''beautiful,'' ''country,'' ''. ''] This is proceeded by mapping, in which each token is assigned a unique integer number in the lexicon of the corpus. Such as - [ ''Bangladesh,'' ''is,'' ''a,'' ''beautiful,'' ''country,'' ''.'']→ [ 34,90,15,684,55,193]. Then, for each word in the sequence, we obtain its embedding. Every phrase in the sequence is associated with an embedding (emb_dim) dimensional vector that the model will discover throughout learning. Consider it a vector look-up for each token. The members of those vectors are handled as model parameters and adjusted via back-propagation in the same way that other weights are optimized.
As a result, we search up the vector associated with each token. For example, this is depicted below equation: Then we generate a matrix Z of dimension that is: (input_length)x(emb_dim) by stacking each of the vectors.It is shown in table 1.
It is critical to note that padding was utilized to ensure that all input sequences in a batch were identical in length. Such that, we lengthen a few of the sequences by including 'pad' tokens. The sequence following padding for the 9th length will be as: 5,5,5,34,90,15,684,55,193] B. POSITIONAL ENCODING BERT algorithm gets an advantage by learning positional embedding. The generated sequence of texts is represented as a matrix, although these representations do not consider the fact of a word's existence in a variety of places. But it needs to be able to change the representational meaning of a word based on its position. Though it is not intended to alter the word's complete representation; rather, it aims to alter it slightly to encode its placement.
This analysis adopted a strategy of adding numbers between [-1,1] to the token embeddings using non-learnable sinusoidal functions. The remainder of the encoder represents the word slightly differently based on its place (even if it is the same word).
Additionally, the encoder uses the fact that some words are in a given position while additional words are in a different specific position within the same sequence. We want the network to comprehend both absolute and relative positions. In [38], the authors' choice of sinusoidal functions enables the representation of locations as linear combinations of one another, allowing the systems to learn relevant relationships among token positions.
We add a matrix P with positional encoding to Z to incorporate this information. Then it becomes P + Z .
BERT employs a synthesis of sinusoidal functions. In terms of mathematics, the token's location in the sequence is denoted by I , and the position of the embedding feature is denoted by j.The sinusoidal function is described in the below equation.
More precisely, the positional embedding matrix for a given text P would be as table 2: This deterministic approach possessed a number of distinct advantages over learned positional representations. For example, the input length parameter can be increased endlessly because the functions can be calculated for any arbitrary place. Additionally, fewer parameters had to be learned. Thus the model could be trained more quickly.
The resulting matrix is X = Z + P and it has the size (input_length)x(emb_dim). It is the input of the first encoder block.

C. ENCODER BLOCK
The BERT Encoder is a transformer-based encoding method based on the combination of attention mechanism and a feed-forward neural network. The Encoder consists of multiple encoder blocks stacked on top of one another. Each encoder block comprises two feed-forward layers and a bidirectional self-attention layer [40].
When data passes through encoder blocks, a matrix of dimensions (Input length) x (Embedded dimension) is returned for a given input sequence generating positional information by positional encoding. Mainly these total N blocks of the Encoder are attached to obtain the output. A particular block is responsible for establishing relationships between the input representations and encoding them in the output. The architecture is illustrated in the figure 1.

D. MULTI-HEAD ATTENTION
The Encoder's architecture is built around multi-head attention. It calculates attention h multiple times using various weight matrices and then concatenates the results [38]. A head is the outcome of each of these parallel computations of attention [12]. The subscript i will be used to signify a particular head and its corresponding weight matrices. Concatenation will occur once all the heads have been computed.This produces a matrix with the dimensions Input_Length * x(h * d_v). Eventually a linear layer consisting of the weight  matrix W 0 of dimension (h * d_v) * Embedding_dimension is added, producing an ultimate output with the dimensions Input_Length * Embedding_dimension. In terms of mathematics: In this case, Q, K , and V serve as placeholders for various input matrices.

E. SCALED DOT-PRODUCT ATTENTION
At the mechanism of scaled Dot-Product Attention, each head is defined by three distinct projections (matrix multiplications) specified by matrices: The input matrix X is projected separately via these weight matrices to compute the head.
We use these K i Q i and V i to determine the scaled dot product attention.
Here,The dot product of these K i and Q i projections can be used to quantify the similarity of token projections. Considering m i and n j as the i_th and j_th token's projections via K i and Q i , correspondingly, the dot product is as follows:   It denotes the similarity in direction between n_i and m_j. Following this, the matrix is scaled by dividing it element-wise by the square root of d k . The next stage involves implementing softmax row-by-row. As a result, the row value of the matrix converges to a value between 0 and 1, which sums it to 1. Lastly, V_i multiplies this result to get the head [3].
Considering our dummy example: Bangladesh is a beautiful country. Then, the resulting representation of ''Bangladesh'' could look something like table 3.
Then multiplying this by v_i we get the outcome of table 4. This generates a matrix in which each row is composed off the token's representations projected via V_i showing at table 5.
A unique head here symbolizes the cohesion of ''Bangladesh'' and ''country.'' We can calculate this h amount of times (h heads) where each encoder block is required for storing these different relationships. Taking the earlier case as the first head.
At this stage, ''Bangladesh'' would be represented as Concatenating h weighted variations of token expressions using h distinct learned projections yield the token representation.
The following layers comprise the position-based Feed Forward Network. Such that, for every row in the preceding layer's output.
where W _1 and W _2 are (emb_dim)x(d_F) and (d_F)x(emb_dim), respectively. Token vector representations do not ''interact'' with one another. It is equal to performing the computations row by row and then stacking the rows in a matrix. This step's output has the dimensions (input_length)x(emb_dim).
The output of this step is then passed to the dropout, add, and norm layers. Between position-aware feed-forward networks and dropout, add, and norm networks, there is always a layer named sublayer. A sublayer is a layer with identical inputs and outputs (Multi-Head Attention or Feed-Forward). Dropout is applied with a 10% probability following each Sublayer. This is referred to as Dropout(Sublayer(x)).
This result is applied to the input x of the Sublayer, yielding x + Dropout(Sublayer(x)) This is accomplished in the Multi-Head Attention layer by supplementing the representation of a token x with its original representation based on its relationship to other tokens.
Finally, using the mean and standard deviation for every row, a token-wise/row-wise normalization is constructed. This increases the network's stability.
These layers produce the following: This is the architecture that underpins all of the magic in cutting-edge NLP.

IV. METHODOLOGY
Bangla, the seventh most widely spoken language globally, continues to be resource-constrained, resulting in a shortage of downstream task datasets and pre-trained language models. Hence, this paper will revolutionize Bangla NLP by adopting BERTs monolingualism, leaving behind the multilingual phenomenon implemented on a restricted subset. We, therefore, introduce a pre-trained BERT model (Bangla-BERT) for Bangla natural language processing. Bangla-BERT is an optimized BERT variant that achieves state-of-the-art performance in Bangla NLP downstream tasks. On a large-scale Bengali corpus, it is highly compatible with the Bengali word dimension and lexicality. Having a similar model architecture as BERT, we execute additional pre-processing actions to ensure that the architecture easily fits within our massive Bangla Corpora. The presented methodology consists of five main tasks or two phases. The first two tasks, or the first phase, comprise collecting and processing data relating to the dataset. The following three tasks focus on model architecture, including training setup, parameter estimation, and model training.

A. DATA COLLECTION
The initial step in training the Bangla-BERT model is to build a suitable unlabeled text corpus. Since BERT is a transformerbased mechanism, it needs a huge corpus for perfect training. BERT was initially trained on 3.3 billion words retrieved from the enormous English Wikipedia and the Book Corpus. Since the Bengali Wikipedia dumps are rather modest compared to the English ones, we developed the largest Bangla language modeling dataset to resolve this issue. It is an enormous corpus of internet sources, including news, web discussion, blog sites, government journals, TED Talks, subtitles, newspapers, articles, and an internet crawl to generate a sufficiently large and unannotated corpus for pre-training. Consequently, the dataset contains recent news articles from various prominent Bangla newspapers, including Prothom Alo, BD News, Jugantor, and Jaijaidin. Table 6 contains data samples.

B. DATA PROCESSING
It is essential to have a high-quality and structural Bangla corpus to train Bangla-BERT. Consequently, We made structural to BanglaLM from raw data as the conduction of this work. The BanglaLM dataset is available in three variants: raw, pre-processed V1, and pre-processed V2. While the raw version can be pre-processed to meet the requirements of any specific task, we used Preprocessed V1 for pre-training the model and Pre-processed V2 for fine-tuning. The exact size of the whole dataset is 39 GB, including 3 versions, V1 and V2 variants, each containing approximately 20 million observations. In addition, the training corpus consists of around 821 million words and 1.7 million unique words. Mainly the text data in strings of varying lengths were dealt with. An intense cleaning and filtration process have been employed for each subcorpus. Moreover, the noise, emoticons, URL tags, HTML tags, and all the non-meaningful stuff such as telephone/fax numbers, email addresses, and so on have been eliminated. Any advanced linguistic operation like Stemming and lemmatization has not been applied to the training. Since BERT is context-based and has syntactic abilities, changing words to root words by these operations (lemmatization, stemming) reduces the syntactic abilities and context. All foreign languages from the dataset except English were removed because their attendance has less than 0.01% and had no meaningful impact. Punctuations have not been removed in the pre-processed V1 rather than the V2 since it aids in recognizing the word relation. Additionally, it has been ensured that all sentences adhere to a minimum and maximum word length by applying a minimum of 3 and a maximum of 512 as a threshold. Table 7 summarizes the dataset's properties before fitting into the pre-trained model v2.

C. TRAINING SETUP
The monolingual BERT procedure is nearly identical in all languages. The pre-training process begins with forming a vocabulary based on the available corpora. Then, Bytepair-encoding (BPE) is mainly used to produce cased and uncased vocabulary. Proper execution of these steps considerably improves the model's performance. In addition, the model works better if sentences are tokenized (i.e., the fewer parts each word is split into), as tokenized sentences are more accurate [29]. Our pre-training process is divided into two essential activities. The first is masked language modeling, whereas the second is next sentence prediction. We have used Cross-entropy loss to train a Masked Language Model (MLM) for predicting random masked tokens. Given N tokens, 15% of them are randomly chosen for this purpose. These are derived from 80 percent of selected tokens are replaced with an exclusive [MASK] token, 10% with a random token, and 10% remain untouched.
Our process is depicted in figure 2.

D. PARAMETER ESTIMATION
Our model's setting is critical to obtaining the desired output. That is why we carefully select the model configuration value. The size of the feedforward layer, or intermediate size, is 3072. We have set the pad token id to 0. The encoder and pooler's non-linear activation function (function or string) is gelu. The standard deviation of the truncated normal initializer used to initialize all weight matrices is 0.02. We have set the use_cache to True to indicate whether or not the model must supply the model's most recent key/value attention. We detail each parameter and its value. The whole parameter estimation or model configuration is presented in the table 8.

E. MODEL TRAINING
Our model is based on the BERT architecture, and in the training setup, this work mainly uses the original BERT VOLUME 10, 2022  [41], which is well-suited for cases involving a large amount of data or parameters, as BERT [3] demonstrate. The learning rate as 1e-6, β1 = 0.900 and β2 = 0.999 was chosen and 1e-6 epsilon was used for numeric stability. Pre-training was conducted entirely on Google's Cloud TPU V3, and it took 120 hours to complete the phase.

V. VOCABULARY BUILDING
Tokenization is breaking down a phrase, sentence, paragraph, or even an entire text document into small chunks called tokens. Tokens are mainly instances of a linguistic unit in speech or writing, instead of the type or class of linguistic unit of which they are an occurrence. Sub-word tokenization is the most effective approach among several tokenization techniques because it tackles the Out Of Vocabulary (OOV) problem and considerably reduces the number of model parameters. Mainly it is based on the principle that frequently recurring terms should be included in the vocabulary, while uncommon words should be divided into repeated sub-words. There are several methods of sub-word tokenization, and one of them is the word piece tokenizer [42]. In the instance of the proposed model, this work employs the wordpiece tokenizer.
WordPiece begins by incorporating all characters and symbols into its base vocabulary. After establishing the desired vocabulary size, the strategy is to continue inserting subwords until the intended vocabulary size is obtained. WordPiece picks the one representing the maximum probability of the training data while expanding the vocabulary. Additionally, WordPiece determines the frequency of appearance of individual symbols and integrates them into the vocab depending on the count below [43].
Count(x, y) = frequency of (x, y)/frequency(x) * frequency(y) The symbol pair with the highest count will be selected for incorporation into the vocab. Whenever a pair is introduced to the vocab, the model is retrained with the new vocabulary.  This procedure is conducted till the required vocab is attained. This procedure is conducted till the required vocab is attained.

VI. EVALUATION AND RESULT
We assessed Bangla-BERT on four downstream tasks for Bangla language comprehension and these are cross-lingual sentiment analysis, named entity recognition, binary Text Classification, and multi-class sentiment analysis. In addition, we have evaluated Bangla-BERT to the multilingual variant of BERT, including other enhanced neural techniques such as fasttext [44], word2vec [45] for findings for each task as a baseline.

A. EVALUATION METRIC 1) ACCURACY
Accuracy is a performance parameter for machine learning classification models that is defined as the proportion of true positives and negatives to the total number of positive and negative observations. In other words, accuracy is the proportion of times we anticipate our machine learning model to predict a result correctly out of the total number of times it has made predictions. Mathematically, it defines the ratio between the sum of all true positives (TP) and true negatives (TN). F1 score indicates the model score as a function of the recall and precision scores. F1 score is an alternative to Accuracy metrics that provides equal weight to both Precision and Recall when analyzing the performance of a machine learning model in terms of accuracy. It can be mathematically expressed as a harmonic mean of precision and recall score.

6) HAMMING LOSS
The Hamming loss is the proportion of wrongly predicted labels.
Where, y_true is the actual labels, and y_pred is the probability.

B. NAMED ENTITIY RECOGNITION
Named Entity Recognition (NER) categorizes various tokens in a text using pre-defined categories. It is structured as a categorization (or tagging) task at the word level, with classes referring to pre-defined groups such as persons, places, institutions, occurrences, and time expressions. While most machine learning approaches have been used previously to solve the Bangla named entity task, including Hidden Markov Model (HMM), Conditional Random Fields (CRF), Support Vector Machine (SVM), Maximum Entropy (ME), and Multi-Engine Method, the BERT approaches have yet to demonstrate their ability.
We have used a dataset created by [46] for this task. This dataset contains train, test sets, and the F1 score is used to assess the effectiveness. Around 96697 tokens from prominent newspapers were used to construct the annotation, 67554 tokens for training purposes, and 29143 words for testing purposes.   We provide the optimal NER approach that produces Stateof-the-art (SOTA) outcomes and outperforms all previous methods on this criterion.
The results in Table 9 demonstrate that our model far exceeds all previous work in this domain, obtaining 0.9995 F1 scores. It outperforms the preceding BGRU+CNN model by 0.2659 and the multilingual variant by a factor of 0.0481. mBERT has an F1 score of 0.9514, which is 0.22 percent better than the previous best model. The precision and recall of the mBERT are 0.9587 and 0.9429, respectively, while Bangla-BERT reaches 0.9980 and 0.9999. As a result, the Bangla-BERT model becomes the new state-of-the-art for NER on the Bengali NER corpus.

C. CROSS LINGUAL SENTIMENT ANALYSIS
Sentiment analysis is a sub-field of Natural Language Processing that focuses on examining individual views or emotions about a particular case acquired from various resources [47]. Due to a lack of annotated data and a scarcity of language processing tools, research on sentiment analysis in low-resource languages such as Bangla remains undiscovered. Salim Sazzed [48] produced and annotated an extensive corpus of approximately 12000 Bangla reviews, which became the benchmark for the Bangla sentiment analysis. This corpus has 11807 annotated reviews, including about 2-300 Bengali words per review. All 12000 reviews were categorized as good, negative, or non-subjective. This corpus is unbalanced in content, with 3307 adverse reactions and 8500 favorable ones. The shape of the training data is (9444, 524) in dimension. We chose this dataset for the Sentiment Analysis research. Table 10 illustrates the sharp distinction in performance between all previous word embedding approaches and the BERT architecture. The Word2vec approach has the lowest effectiveness, with an accuracy of 0.8587 percent and an F1 score of 0.73. It contains a 0.1413 error which is also the maximum among other methods. While the two variations of BanfastText (CBOW, skip-gram) achieve nearly identical accuracy of 0.9441 and 0.9373, respectively, the resultant F1 score is 0.9101 and 0.9134. The multilingual BERT variation performed marginally better than earlier techniques, earning 0.92 accuracies and a 0.93 F1 score. Bangla-BERT significantly surpasses these approaches, reaching 0.9703 accuracies and a 0.9621 F1 score.
The BanfastText indicates a loss of approximately 0.0526, but the word2vec model yields a loss of almost three times the VOLUME 10, 2022 BanfastText(CBOW, Skip-gram) value of 0.1414. A decrease in the losses implies an improvement in the classifier's performance. Hence we get a loss value of 0.0263 and the best result from the BERT model. The AUC score is a good measurement of the classifier as it is not biased on the dataset. The word2vec shows 0.749 AUC, whereas the Ban-fastText(CBOW, skip-gram) shows improvement by 10% and 9% correspondingly. The mBERT and Bangla-BERT models show higher AUC scores of 0.899 and 0.939, respectively. The error is minimum in Bangla BERT, showing only 0.0297, whereas the mBERT shows 0.0509 error.

D. BINARY TEXT CLASSIFICATION ON BANFAKE NEWS DATASET
We have evaluated our model's performance using the Bangla fake news dataset, which comprises 50K Bangla news items and can also construct automated fake news detection systems. This dataset establishes a new baseline in the Bangla language for binary text categorization by considering a broad range of linguistic features. They gathered legitimate news from Bangladesh's 22 most famous and widely reliable news portals. They employed misleading, clickbait, and satirical contexts to arrange all of the content in the dataset under 12 categories, which are further divided into authentic and fake news. There are 48678 accurate news sources out there, yet there are also 1299 sources spreading fake information [49].
The results of fine-tuning the BanfakeNews dataset are shown in table 11. The accuracy of Word2vec is 0.9455, the F1 score is 0.75, and the hamming loss is 0.1814. In addition, it includes a 0.0545 error which is also the maximum among all other techniques. On the other hand, the BanfastText(CBOW) performs considerably better than the Word2vec, with 0.9814 accuracies and a 0.8812 F1 score with 0.1801 hamming loss. The Banfasttext, both CBOW and Skip-Gram demonstrated the same level of accuracy. However, the skip-gram approach produces a higher F1 score of 0.8923, and the hamming loss is nearly equivalent to that of CBOW, containing 0.1802. The mBERT technique has an accuracy of 0.9809 and an F1 score of 0.9201. The hamming loss is 0.1123, more minor than 0.0679 in the previous Skip Gram Model. The Bangla-BERT delivers superior results across all three factors, with an accuracy rate of 0.99.41, an F1 score of 0.9421, and the lowest hamming rate 0f 0.1013. In addition, it shows a 0.0059 error which is the minimum among all other techniques. Bangla-BERT's 0.979 AUC score is the best overall performance among all other methods. Thus, Bangla-BERT became the new state-of-the-art binary text classification model.

E. BANGLA NEWS COMMENT BASELINE(MULTICLASS SENTIMENT ANALYSIS)
The Dataset For Sentiment Analysis On Bengali News Comments is a genuine and trustworthy dataset that is freely accessible to everyone to evaluate various models. For multi-class sentiment analysis, we apply our model to this dataset. The data was gathered from a well-known online news portal called Prothom-Alo, containing 13809 posts. The top ten most often appearing comment topics were used to describe the dataset. Opinions, sports, the Bangladesh economy, entertainment, and technology are just a few examples. The data set is classified into five traditional sentiment categories: strongly positive, positive, neutral, negative, and strongly negative, with each input being tagged three times to assure the data set's validity and reliability. The slightly positive label contains 1436 observations, the positive, neutral, and negative labels have 2279, 2955, and 3936 observations, and the slightly negative label includes 3203 observations. The data set is not biased in any direction. Typical model  evaluation methods fail to quantify model performance adequately when confronted with unbalanced data sets. The characteristics of the minority class are frequently dismissed as noise. As a result, there may be a considerable risk of the minority class being classified incorrectly compared to the dominant class. There are 248562 words in all, 244432 of which are in Bengali, 4130 of which are in other languages, and the remainder is in numeric language [50].
The following table 12 summarizes the results of multiclass sentiment analysis. As we descend from the word2vec to Bangla-BERT, the accuracy changes regularly. The gap between the earliest feature extraction method and the most recent transfer learning methodology is relatively large, at 0.24. The Word2vec approach produces a 0.71 F1 score with a 0.4175 hamming loss. BanFastText CBOW and BanfastText skip-gram obtain 0.76 and 0.75 accuracy, respectively, whereas Word2vec achieves 0.5824 accuracies. Word2vec comprises 0.4177 error which is also maximum among all other techniques. The F1 scores for Banfast-Text CBOW and Skip-gram are nearly equal, containing 0.7491 and 0.7427, respectively. CBOW has a hamming loss of 0.2966, while skip gram has a hamming loss of 0.2879. There is a minor improvement in the BanfastText two variants, with roughly 0.74 F1 scores, compared to the Word2vec's 0.71. The BERT's mBERT form has a greater accuracy of 0.7992 and an F1 score of 0.7621. The hamming loss is minimal than all previous methods, which is 0.1942. The BERT approach yields a significantly lower VOLUME 10, 2022 hamming score of as little as 0.1013. Moreover, it outperforms all preceding methods in accuracy and F1 score, obtaining 0.84 and 0.81, respectively. The AUC score is also the highest in Bangla-BERT, comprising 0.826. It is better than the skip-gram LSTM model by 9% and word2vec by 24%. It also shows a 0.1584 error, the minimum among all other techniques. Table 13 highlights the performance of all state-of-theart (SOTA) approaches and Bangla-BERT on some of the most renowned datasets on Bangla text classification ever created. For example, for Binary Sentiment analysis on the Banfakenews dataset, the previous best performance is 0.91 F1 when they [49] combine all standard linguistic features with an SVM classifier. However, Bangla-BERT outperforms this with an F1 score of 0.9421.

VII. COMPARISON WITH THE PREVIOUS STUDIES
In the dataset for multiclass sentiment analysis on Bengali News Comments And Its Baseline Evaluation, the previous state-of-the-art result obtained an accuracy of 0.7474 and 0.79 F1 scores using the LSTM model. However, the Bangla-BERT model improves this result by establishing new state-of-the-art results of 0.8417 accuracies and 0.8104 F1 scores for this dataset. Another dataset, Cross-lingual Sentiment Analysis in Bengali Using a New Benchmark Corpus, confirms Bangla-BERT's superiority. The previous best result cannot exceed Bangla-BERT's result. Using the LR classifier in transfer learning and the best-unsupervised technique, TextBlob, yields nearly identical accuracy and F1 scores of approximately 0.82 and 0.78, respectively. However, their supervised method in conjunction with SVM makes an improved performance of 0.93 accuracies and 0.91 F1 scores. Bangla-BERT outperforms all three approaches and can be designated the state-of-the-art model for these datasets in Bangla.

VIII. FUTURE WORK
Using this as an experiment for a resource-constrained language like Bangla, we have illustrated how powerful a pre-trained deep model could be. The following stage will evaluate RoBERTa and other BERT architectures such as DeeBERT, MobileBERT, SpanBERT, and AlBERT in Bangla to strengthen the NLP phenomenon in Bangladesh. Additionally, we intend to propose a high-level API and a Python-defined module so that the developers may access and use the model in various applications. We will examine the usefulness of the Bangla-BERT model across various business applications. Besides, We will explore the encoding of multiple levels of linguistic abstraction within Bangla-BERT to properly comprehend and analyze the model's acquisition of different information. Currently, the majority of users mix Bangla and English in a variety of contexts. However, our model has only been trained in Bangla. Hence our Bangla BERT cannot be used for these applications. In the future, we will combine Bangla and English in datasets to train a model.

IX. CONCLUSION
The emergence of Transformer-based pre-trained language models rapidly expanded the accessibility of high-performing models to the typical user. However, several established multilingual BERT models include Bangla, The only Bangla-specific BERT model known to date trains on minimal website data. We used the most extensive Bangla text corpus to pre-train the language model. This paper efficiently pre-trains the Bangla-BERT model following state-of-the-art BERT architecture. We make it available to the community with the training corpus and evaluation benchmarks. Practitioners from fields other than computer science can fine-tune them for domain-specific downstream tasks. Because of the ease of use of a pre-trained NLP model, its use cases are much broader. By publishing our Bangla-BERT model, we intend to promote deep learning research and applications in Bangla-speaking nations. Additionally, the work will optimize Bangla NLP models in complexity, storage, and processing requirements.
ABDULLAH AS SAMI is currently pursuing the B.Sc. degree in computer science and engineering with the Chittagong University of Engineering and Technology, Chittagong, Bangladesh. He is also an Instructor in python and machine learning at a number of prestigious online learning platforms. Additionally, he works as a part-time Freelancer and a Deep Learning Enthusiast. He has worked on a variety of deep learning and natural language processing projects. He has successfully developed numerous novel approaches to machine learning problems, implemented them in production, and boasts shown writing and research abilities that contribute to attaining productivity milestones. His research interests include Bengali language processing, machine translation information retrieval (speech recognition), sentiment analysis and opinion mining, and machine learning algorithms.
NUSRAT JAHAN PROTTASHA received the B.Sc. degree in computer science and engineering from Daffodil International University, in 2021. She is currently working with Data Science Platform as a Research Assistant with several professors. In 2020, she received the Best Paper Award from the International Conference of Cyber Security and Computer Science. Besides, in recognition of scholarly publication in the reputed indexed journal has been awarded for publishing four articles in Scopus journals from her university.
MOHAMMAD SHAMSUL AREFIN (Senior Member, IEEE) received the Doctor of Engineering degree in information engineering from Hiroshima University, Japan, with the support of the scholarship of MEXT, Japan. He is in lien with the Chittagong University of Engineering and Technology (CUET), Bangladesh, and currently affiliated with the Department of Computer Science and Engineering (CSE), Daffodil International University, Dhaka, Bangladesh. Earlier, he was the Head of the Department of CSE, CUET. As a part of his Ph.D. research, he was with the IBM Yamato Software Laboratory, Japan. His research interests include privacy-preserving data publishing and mining, distributed and cloud computing, big data management, multilingual data management, semantic web, object-oriented system development, and IT for agriculture and the environment. He has more than 110 refereed publications in international journals, book series, and conference proceedings. He is a member of ACM, a fellow of IEB, and a fellow of BCS. He is the Organizing Chair of BIM 2021; the TPC Chair, ECCE 2017; the Organizing Co-Chair, ECCE 2019; and the Organizing Chair, BDML 2020. He visited Japan, Indonesia, Malaysia, Bhutan, Singapore, South Korea, Egypt, India, Saudi Arabia, and China, for different professional and social activities.
PRANAB KUMAR DHAR received the B.Sc. degree from the Chittagong University of Engineering and Technology (CUET), Chittagong, Bangladesh, in 2004, the M.Sc. degree from the University of Ulsan, Republic of Korea, in 2010, and the Ph.D. degree from Saitama University, Japan, in 2014. In 2005, he joined as a Lecturer with the Department of Computer Science and Engineering, CUET, where he is currently working as a Professor. He has published over 30 refereed journal articles and 40 conference papers. He is the author of two books, one book chapter, and one patent. His research interests include multimedia security, digital watermarking, steganography, multimedia data compression, sound synthesis, digital image processing, and digital signal processing. He is a member of the technical committee of several international conferences. He serves as a reviewer of various reputed journals, including IEEE, IEICE, Elsevier, and Springer.
TAKESHI KOSHIBA (Member, IEEE) received the B.E., M.E., and Ph.D. degrees from the Tokyo Institute of Technology, in 1990, 1992, and 2001, respectively. He is currently a Full Professor at the Department of Mathematics, Faculty of Education and Integrated Arts and Sciences, Waseda University, Japan. His research interests include theoretical and applied cryptography, randomness in algorithms, and quantum computing and cryptography.