BaNeP: An End-to-End Neural Network Based Model for Bangla Parts-of-Speech Tagging

In Natural Language Processing, Parts-of-Speech tagging is a vital component that significantly impacts applications like machine translation, spell-checker, information retrieval, and speech processing. In languages such as English and Dutch, POS tagging is considered a solved problem (accuracy: 97%). However, for low-resource languages like Bangla, challenges are still there. In this article, we have proposed a novel RNN-based network named BaNeP to determine parts of speech for Bangla words. The proposed network extracts structural features through a bidirectional LSTM-based sub-network, and intricate contextual relations among words of a sentence are identified through an elaborate weighted context extraction procedure. These features are then combinedly utilized to generate the final Parts-of-Speech prediction. Training the model requires only an annotated dataset vanishing the need for any hand-crafted features. Experimental results on the LDC2010T16 dataset show significant accuracy improvement compared to existing Bangla POS taggers.

Earlier, when there was no language to communicate, humans 23 used sign language to exchange their thoughts, like how we 24 communicate with our pets. Suppose, when we tell our dog, 25 ''Cooper, we love you'', he responds by wagging his tail. 26 This does not mean he actually understands what we say, but 27 he can read our expressions, and understand our emotions 28 and gesture more than words. As the most intellectual being, 29 human has developed an understanding of many nuances of 30 natural languages more than any other animals on this planet. 31 The associate editor coordinating the review of this manuscript and approving it for publication was Long Xu.
That is why, when someone says, ''I LOVE you, Copper'' 32 vs. ''My LOVE, let's go for a long driv''e, the word 'LOVE' 33 has a different meaning. In the first phrase, 'LOVE' is used 34 to express the speaker's love for her/his pet, whereas in the 35 second phrase, the speaker uses the word 'LOVE' to indicate 36 her/his dearest person. As humans, we can understand the 37 contextual meaning of the word LOVE in these two phrases, 38 and that is why, responses will be different. Nevertheless, 39 trying to teach our machines to understand these intricate 40 contextual differences becomes onerous. So, in the future, 41 when we develop a home care robot that hears, ''I LOVE 42 you, Copper'', it will understand that LOVE is a verb, and 43 the speaker's emotion toward the dog is expressed. So the 44 robot should pay more attention to caring the Copper vs. ''My 45 LOVE, let's go for a long drive'', where LOVE is a noun, and 46 the robot can understand this is not of its business; it simply 47 leaves the room. This example highlights how important it is 48 to identify Part-of-Speech of a particular word to understand 49 the meaning of the word in a different context. An application 50 like text-to-speech conversion performs POS tagging as a 51 part of preprocessing. For example, in the sentence ''They 52 refuse to permit us to obtain the refuse permit'', the word 53 example, 107 W = {C 0 , C 1 , C 2 , . . . C n }. 108 The main objective of the POS tagging method is to find 109 the proper Part-of-Speech for each word W k of the sentence S. 110 For example, the POS tag for the word the ' ' will be a 111 common noun. So, we can narrow down our task to take a 112 sequence as input and find a tag for it as we view a word as a 113 sequence of characters, which is ultimately aligned with the 114 sequence labeling procedure, a popular and crucial process in 115 NLP. 116 Generally, two types of information are processed to deter-117 mine part of speech of a particular word: word definition 118 or word structure and its sense or contextual relation with 119 other words in the sentence. For example, in the English 120 language, words having suffix -ion, -sion, -tion, -acy, -ment, 121 etc., are commonly categorized as a noun such as population, 122 accuracy, and government. Similarly, words ending with -ly, 123 -ful, -ous, etc., are generally considered as an adjective. But 124 some exceptions exist, such as 'happily' ends with -ly seems 125 to be an adjective, but it is an adverb. So, word definition 126 or structure will not help much to identify the POS tag of 127 a word. The situation becomes more arduous when we need 128 to understand the word sense or purpose of its usage within a 129 particular sentence or phrase. For example, 130 • NOUN -He carried a log on his back.

131
• VERB -He did not back me in this case.

132
• ADJECTIVE -He went through the back door.

133
• ADVERB -He turned back to look at me. 134 Here, the word 'back' has different POS tags in different 135 sentences depending on its sense and context. Some words 136 can be converted to a different Part-of-Speech in English, 137 like 'choose' is a verb and its corresponding noun is 'choice '. 138 In this case, the word structure is also changed with the Parts-139 of-Speech. However, in the previous case, the POS tag for 140 the word 'back' entirely depends on the context, which is 141 quite difficult to understand not only for machines but also 142 for us having intelligence. So, word structure provides some 143 information regarding its POS tag, but the context of the 144 word within the corpus significantly impacts determining the 145 proper POS tag. Imagine the situation when the corpus is a 146 novel, poetry, or some intellectual metaphor [10], [11] written 147 by some world-famous writers. How challenging will it be to 148 extract the word context from such corpus? So, it is evident 149 that the POS tagging process of languages with less struc-150 tural features and depend heavily on contextual sense will be 151 more challenging. That is why less resourceful languages like 152 Bangla do not have significant POS tagging methodologies 153 with noticeable accuracy. Let us take a more detailed look at 154 the origin and evaluation of the Bangla language to under-155 stand its challenges and associated difficulties.

156
History reveals that the Bangla language can be traced back 157 to 3500 B.C. to the Indo-European language family. It has 158 been assumed by many scholars and linguistics that Bangla 159 was born from Sanskrit. But Every type has subcategories, just like English and other 217 languages. To make the situation more challenging, almost 218 every word from every subcategory has a ''shuddho/formal'' 219 form and a ''cholito/causal'' form. 220 After the battle of Plassey, western influence began shaping 221 local cultural norms. As a result, the English language started 222 to impose a significant impact on Bangla. Words such as 223 Chair,  [18] jointly. However, these RNN-based models, 328 rather than replacing, combined hand-crafted features for 329 better performance. As a result, the performance of their 330 system drops rapidly when the model solely depends on 331 neural embedding. Focusing on those shortages, a sophis-332 ticated neural network-based architecture that incorporates 333 BiLSTM, CNN and CRF was proposed by Ma and Hovy [19]. 334 CNN is used to encode character-level information of a 335 word. Then, character and word level representations are 336 combined to feed BiLSTM and CRF to decode labels for the 337 entire sentence. That is how their model benefits from both 338 word and character level information without requiring task-339 specific resources, feature engineering, or data preprocessing 340 beyond pre-trained word embedding on the unlabeled corpus. 341 Experiments showed that the proposed model got 97.21% 342 accuracy on POS tagging for Penn Treebank WSJ corpus and 343 91.21% accuracy on named entity recognition for CoNLL 344 2003 corpus [20]. In 2019, Akbik A et al. came up with the 345 idea of contextual string embedding, where a character lan-346 guage model is trained to leverage its internal state in order to 347 produce a novel word embedding system [21]. They claimed 348 that considering words as a sequence of characters, and their 349 model was trained without any explicit notion of words. 350 Besides, they used different embedded representations of the 351 same word depending on its context. Their model's perfor-352 mance was evaluated on the CONLL 2003 shared task dataset 353 and achieved a state-of-the-art F1 score for English (93.09%) 354 and German (88.32%). Tasi et al. [22] proposed BERT based 355 model for sequence labeling task. They were able to design 356 a small, faster, and high-accuracy model for multilingual 357 language that beat the sate-of-the-art baseline [23], [24]. The 358 problem is that in some languages like Bangla, determining 359 the contextual relation of a particular word may require longer 360 text. However, BERT or other transformer-based models can 361 only grasp contextual relation from a user-defined fixed 362 length. An attention-based RNN model has been proposed 363 by Lin et al., which relies on a hierarchical attention neural 364 semi-Markov conditional random fields (semi-CRF) model 365 for the task of sequence labeling [25]. In their proposed 366 model, character and word level information, along with an 367 attention mechanism, is used to determine the sequence label. 368 Their model is evaluated on three sequence labeling tasks: 369 named entity recognition, chunking, and reference parsing. 370 Experimental results showed that their model achieves com-371 petitive and robust performance in all three tasks. This is 372 undoubtedly a remarkable work, but their model is designed 373 for generic sequence labeling tasks. The authors did not pay 374 any particular attention to the complex cases that can occur 375 in POS tagging tasks, as mentioned in Section I, especially 376 for the languages that have a minuscule dependency on struc-377 tural information and demand for significant processing in 378 contextual relation like Bangla language. As a result, their 379  This section focuses on parts-of-speech taggers of the lan-387 guages that belong to the Indo-Aryan language family, like 388 Bangla. One of the oldest approaches in Parts-of-Speech 389 Jain [26] which is based on a handwritten set of rules. These  Hindi language POS-Tagging at that time. Another research 408 carried out by Pisceldo et al. [28] on stochastic POS-Tagging 409 techniques for the Indonesian language. They constructed 410 their model using CRF and ME methods for assigning the 411 POS tag to a word [28]. Over the years, researchers such In this section, we analyze those methodologies designed 444 for Bangla POS-Tagging. We also extend our discussion 445 over languages almost similar to Bangla, such as Assamese. 446 We start our discussion with a rule-based approach pro-447 posed by Chowdhury et al. [36]. In their proposed model,   In between, several other researches are also done in Bangla 498 POS-Tagging, but nothing seems convincing in terms of accu-499 racy, dataset they used, and model architecture. For instance, 500 Kabir et al. trained a Deep Belief Network (DBN) on the 501 LDC dataset for the Bangla POS tagging task and claimed to 502 achieve an accuracy of 93.33% [41]. But, the authors did not 503 delineate their model's architecture, making it impossible to 504 replicate their network. Even optimal hyper-parameter com-505 binations for the reported accuracy are also not mentioned. 506 In 2018, Uddin et al. proposed a feed-forward neural network 507 approach for Bangla POS tagging [42]. They built a tree 508 structure named Trie to capture the structural similarity of 509 the word. But, in the case of POS-Tagging, a similar word 510 may have a different POS tag depending on the context. After 511 that, the authors deploy a simple FNN to calculate a word's 512 probability of having a particular POS tag. Their model did 513 not determine any long-term dependency of a particular word 514 within its context. Moreover, they did not use any remarkable 515 dataset to evaluate their model. A rule-based approach with an 516 approximate accuracy of 94% is proposed by Roy et al. [43] 517 where authors constructed several grammatical rules to iden-518 tify the POS tag of a particular word. As mentioned earlier, 519 rule-based models have lower adaptability to new or unseen 520 contexts, reducing their dependability. Besides, it does not 521 guarantee that all rules have been exhaustively considered. where, Similarly, SFENet then concatenates forward and backward cell 584 states at time t = n, which essentially holds structural 585 characteristics of the word after training.
SFENet does not intend to extract the structure of the entire 588 sentence, and the processing of each word is independent. 589 This creates the opportunity for parallel processing of each 590 word in the sentence (in fact, all words of all sentences in a 591 batch) if computational power is there. Thus character-level 592 processing of SFENet does not create a time-dependent 593 bottleneck for the entire poss-tagging task, especially for 594 contextual feature extraction, which depends on word-level 595 processing. The potential of SFENet stretches beyond Part-596 of-Speech tagging. Applications like named-entity recogni-597 tion, lemmatization, and word sense disambiguation require 598 structural feature extractions creating scope for SFENet.

600
Almost every language has homonyms, making POS-tagging 601 beyond the scope of the structural feature. Taking context 602 into consideration is thus inevitable. Due to its highly inflec-603 tional nature, Bangla complicates the scenario by many folds. 604 POS detection for many languages has become near perfect 605 with simple pre-trained Word2Vec for semantic features with 606 context sense disambiguation through bidirectional LSTM 607 applied on the merged semantic and structural feature. Some 608 languages require an extra CRF (Conditional Random Field) 609 based procedure to properly grab the contextual impact of one 610 word on another in a sentence. However, the inherent com-611 plex contextual nature of Bangla can hardly be grabbed with 612 such shallow architecture for context encoding. To address 613 this need for deeper contextual consideration, we present 614 a two-phase weighted context generation procedure taking 615 inspiration from an attention-based encoder-decoder model. 616 In the first phase, a neural network generates unweighted 617 context for the entire sentence, and a second neural network 618 generates weighted context for each word using another neu-619 ral network.   Figure 3 shows 682 the model structure of WCGNet.

683
WCGNet is inspired by the LSTM-based decoder archi-684 tecture of the sequence-to-sequence model. Unweighted con-685 text from context encoder network passes through Bahdanau 686 attention layer along with WCGNet's previous hidden state. 687 The context encoder's last hidden state is supplied for the first 688 word of the sentence instead of WCGNet's last hidden state 689 as it does not exist for the first word. Equation 9 and 10 show 690 the attention weight calculation mechanism for the first and 691 later words, respectively. = WCGNet's previous hidden state The output of the attention layer is a weight matrix which is 698 then multiplied with unweighted context to generate weighted 699 context WC k for Word k . The second word in the sentence 700 WC 0 (weighted context for the first word) and one-hot 701 encoded Start token are combinedly passed through an LSTM 702 cell. The hidden state of this LSTM cell is used to calculate 703 the weight matrix for WC 1 . From the third word onward, 704 LSTM cell output from the immediate previous timestamp is 705 used instead of a one-hot encoded start token for the current 706 timestamp's LSTM cell input.

708
For each word Word k in a sentence, SFENet generates struc-709 tural feature SF k , and WCGNet, with the help of Context 710 Encoder, generates weighted context WC k that holds con-711 textual and semantic information about the word. BaNeP's 712 prediction generator network (Shown in Figure 4) uses SF k 713 and WC k combinedly (F k = SF k + + WC k ) to generate the 714 parts-of-speech tag for each word.

715
From Figure 4, we can see that, analogous to the context 716 encoder, the prediction generator also utilizes a bidirectional 717 LSTM model overall F k s of a sentence. Although it may 718 seem that BaNeP is repeatedly exploring contextual rela-719 tion as context encoder already extracted that, this second 720 over the sentence LSTM application is designed to grasp 721 contextual relation among words having similar structure. values from yes(y) and no(n), and Emphatic can take values 774 from yes(y) and no(n). So, in the above example, the word 775 which is neither singular nor plural, so for the number 776 attribute, it takes the value Not-applicable (0). Similarly, other 777 attribute values are Not-applicable, non-definite, and non-778 emphatic; the complete tag should be The dataset contains 7168 sentences ( 102933 words) 781 which are divided into two parts Bangla1 (3684 sen-782 tences, 51091 words) and Bangla 2 (3484 sentences, 783 51842 words). The authors collected data from Blogs, Multi-784 kulti (http://www.multikulti.org.uk), Wikipedia, and A por-785 tion of the CIIL corpus under the supervision of Multilin-786 gual Systems Group, Microsoft Research Labs India. In our 787 proposed model, we did not use any information other than 788 the POS tag to avoid dependency on hand-crafted features. 789 BaNeP uses only words as input, and during the training 790 phase, it takes true classes (POS tag of each word) in con-791 sideration for loss calculation and backpropagation.

793
BaNeP has been trained with a wide variety of hyper-794 parameter combinations to trace down the optimal 795 combination which works best for Bangla POS tagging. 796 Performance of BaNeP with optimal hyper-parameter values 797 is compared with state-of-the-art POS tagging approaches for 798 Bangla BiLSTM-CRF and sequence labeling task ASRNN. 799 This section focuses on hyper-parameter tuning for BaNeP rely on a predicted tag's correctness to be further used as a 832 feature for dependent NLP tasks. We have recorded the accuracy of the BaNeP model for two 850 optimizers: i) Adam and ii) SGD for various learning rates. 851 We started with a higher initial learning rate of 0.002 with an 852 assumption from prior knowledge that LSTM-based models 853 perform well with Adam optimizer compared to SGD, and 854 the optimal learning rate lies near 0.0003. So we started 855 our learning rate tracing at 0.002 and gradually decreased it 856 until 0.0002 to find out which value works for BaNeP best. 857 Figure 5 shows validation accuracy of BaNeP for various 858 learning rates with both Adam and SGD optimizer.

859
Learning rates we have tried are: [0.002, 0.0015, 0.001, 860 0.0008, 0.00006, 0.0005, 0.0004, 0.0003, 0.0002] for Adam 861 optimizer. For SGD, we have tried learning rates selectively 862 from the above set, as seen in the figure. Not to our sur-863 prise, the initial learning rate did not affect accuracy much. 864 Adam optimizer adjusts learning rate automatically, and so, 865 except for some initial possibility of overshooting from global 866 minima, Adam optimizer is sturdy. The result of validation 867 accuracy also demonstrates that. Validation accuracy varied 868 from 87.51% to 90.10%. The highest validation accuracy was 869 found at learning set to 0.0002, which was not much higher 870 In the Prediction Generator network (Figure 4), we can see 916 a block for a linear neural network marked as LNN just 917 before the softmax layer. We have experimented by changing 918 the number of layers in this linear neural network to deter-919 mine how that affects the performance. Besides, we have 920 also tried a CRF function instead of directly using softmax 921 for class probability calculation. CRF is heavily used in 922 sequence labeling tasks, especially in part-of-speech tagging. 923 We have introduced CRF after LNN to check whether it 924 brings any improvement to our proposed model. Figure 7 925 shows LNN-Softmax and LNN-CRF performance different 926 single layer LNN to five-layer LNN.

927
For Single layer LNN, CRF outperformed plain softmax. 928 Single layer LNN with softmax achieved 88.65% accuracy, 929 whereas, with CRF, the validation accuracy was 89.12%. 930 However, for two-layer LNN, softmax and CRF perfor-931 mances were similar: 90.18% for softmax and 90.19% for 932 CRF. When we increased layer count to three, softmax per-933 formance remained the same, but CRF performance fell to 934 90.09%. For CRF and softmax approaches, we have observed 935 that increasing layers more than three does not increase accu-936 racy but falls prey to overfitting. CRF solves this overfitting 937 issue a bit; that is why CRF performance is slightly better than 938 using mere softmax.

939
The primary role of CRF here is to generate a tag con-940 sidering the context of the entire sentence. BaNeP already 941 has a two-phase detailed BiLSTM-Attention-based network 942 (Context Encoder and WCG) for contextual consideration. 943 On top of that, Prediction Generator also grasps inter-word 944 structural dependency. These detailed subnetworks combined 945 captured what CRF aims to and possibly more. Thus, aug-946 menting BaNeP with CRF did not significantly improve the 947 model's accuracy. CRF's impact on the network was highest 948 when single-layer LNN was tested. However, for two and 949 three-layer LNN, the impact of CRF is negligible. So we 950  BiLSTM-CRF and ASRNN are very robust models that 1007 perform decently for sequence labeling tasks. However, the 1008 diverse character-level structure makes sentence-level con-1009 textual sense disambiguation more important for Bangla, 1010 which calls for an elaborate contextual feature to predict 1011 Parts-of-Speech accurately. BaNeP is designed to do exactly 1012 so. Both BiLSTM-CRF and ASRNN falls behind in such 1013 cases for Bangla. Figure 10 and Table 4 shows the test 1014 set accuracy comparison among these approaches mentioned 1015 above.

1016
As expected, all of these models perform somewhat sim-1017 ilarly in terms of accuracy. ASRNN showed marginally 1018 better accuracy than BiLSTM-CRF. However, both BaNeP 1019 and BaNeP-CRF perform slightly better than ASRNN and 1020 BiLSTM-CRF due to handling complex cases requiring 1021 deeper contextual consideration. In fact, the reason behind 1022 ASRNN outperforming BiLSTM-CRF is also the same. 1023 ASRNN has a word-level attention layer that gives it an edge 1024 over BiLSTM-CRF for Bangla-POS tagging. As mentioned in Section IV, LDC2010T 16 is a highly 1027 unbalanced dataset. So, we cannot rely merely on the accu-1028 racy measure. To demonstrate how our proposed model 1029 performs for each class compared to ASRNN, we have shown 1030 class-wise precision and recall measures in Figure 11 and 12 1031 respectively.

1061
From the comparative performance analysis presented in 1062 this section, it can easily be said that both BaNeP and 1063 ASRNN have shown competitive performance for Bangla 1064 POS-tagging application. However, BaNeP managed to be 1065 consistent even for smaller classes and cases where con-1066 textual information is more important. ASRNN emphasized 1067 both character-level structure and word-level context equally. 1068 On the other BaNeP emphasized more word-level context, 1069 which worked in its favor as Bangla words show low struc-1070 tural patterns.