Event-Argument Linking in Disaster Domain

Linking event triggers with their respective arguments is an essential component for building an event extraction system. It is challenging to link event triggers with their corresponding argument triggers when the sentence contains multiple event and argument triggers. The task becomes even more challenging in a low-resource setup due to the unavailability of natural language processing resources and tools. In this paper, we study the event-argument linking task based on disaster event ontology in a low resource setup. We use BERT and non-BERT-based deep learning models in both monolingual and cross-lingual event-argument linking tasks. We also perform an ablation study of various features like position embeddings (PE), position indicator (PI), and segment ID (SI) to understand their contribution to performance improvement in non-BERT-based models. Using three different languages viz. Hindi, Bengali, and Marathi, we compare the results with multilingual BERT-based deep neural models in both monolingual and cross-lingual scenarios. We observe that the multilingual BERT-based model outperforms the best performing non-BERT-based model in cross-lingual settings. But in monolingual settings, the performance is similar in Hindi and Bengali datasets and slightly better in Marathi dataset. We choose the disaster domain due to its social implications. Our current experiments can be helpful in mining important information related to disaster events from news articles and building event knowledge graphs in low-resource languages.


I. INTRODUCTION
Due to the advancement of electronic media, a massive 18 amount of digital content is uploaded very frequently on 19 the Internet in today's world. Extracting relevant informa- 20 tion manually from this vast data is impossible. Informa- 21 tion extraction deals with developing tools and techniques to 22 mine the most relevant information from such unstructured 23 data. Event Extraction falls under the broad research area of 24 Information Extraction, an essential part of knowledge graph 25 research. Thus event extraction is essential in building an 26 event knowledge graph. An event is an occurrence happening 27 in a specific place during a particular time interval. The argu- 28 ments of an event refer to the attributes such as the location, 29 time of occurrence of the event, participants involved, and so 30 The associate editor coordinating the review of this manuscript and approving it for publication was Jad Nasreddine .
on. Event and argument triggers refer to any word or phrase 31 which describes an event and argument, respectively. To build 32 a robust event knowledge graph, we must identify all such 33 vital information from the text documents. In the event extrac-34 tion system, event trigger detection, event type classification, 35 argument trigger detection, argument type classification, and 36 event-argument linking are the necessary sub-tasks. In the 37 literature, sub-tasks like event and argument trigger detection 38 and classification have been studied extensively, especially in 39 the English language. However, compared to the other sub-40 tasks, event-argument linking i.e. predicting whether there 41 exists any link between a mentioned event and argument 42 triggers of a sentence, has not been addressed much in the 43 literature. language to another is more effective. In our current proposed 104 work, we also investigate the effectiveness of cross-lingual 105 embeddings in transferring knowledge from one language 106 to another language and compare the results with m-BERT 107 based models. Since the target languages in our experiments 108 have grammatical and linguistic typological (Subject-Object-109 Verb) similarities, it is assumed that transfer of knowledge 110 between them will be effective. 111 For experiments, we choose disaster as a domain for its 112 impact on society. To alert both the public and the govern-113 ment, extracting relevant information at an appropriate time 114 is crucial. Equipped with such information, disaster manage-115 ment can be performed. However, it is impossible to mine 116 information manually from the web due to its enormous size. 117 Moreover, if all the information can be stored in a knowledge 118 graph, then that information can be used further for various 119 applications. Our current research is to build a multilingual 120 event knowledge graph under a low resource setup.

122
Our current task is to determine if there is any relation 123 between the event and the argument triggers. Thus the task is 124 related to relation extraction, where the relations between a 125 pair of entities are extracted. Currently, deep neural networks 126 are being used for relation extraction. Convolutional Neural 127 Network (CNN) [1] is a very useful feature extractor and 128 has been used widely for various text classification tasks 129 in the past. Zeng et al. [2] proposed to use CNN in rela-130 tion extraction for the first time, where CNN was used for 131 lexical and sentence level feature extraction. In this paper, 132 the authors also proposed a novel position embeddings (PE) 133 feature, which was very helpful to achieve high accuracy in 134 classification by using two-fold benefits. Firstly, it specifies 135 the pair of words or phrases to which the predicted rela-136 tion label will be assigned. Secondly, it encodes the relative 137 distance feature to the target words or phrases. Benefiting 138 from these two-fold help, the proposed CNN-based approach 139 improves the classification accuracy. Similarly, to minimize 140 the dependence of external toolkit and resources, the authors 141 in [3] also proposed to use CNN for relation extraction. They 142 used multiple window-sized filters in their CNN architecture 143 to capture wider ranges of n-grams. Santos et al. [4] pro-144 posed a CNN-based relation classifier that performs clas-145 sification using a Ranking CNN (CR-CNN). The proposed 146 model learns a vector representation for each class. The 147 CNN representation of input text is compared to the class 148 representation to generate a score for each class. The authors 149 introduced a pairwise ranking loss which helps in reduc-150 ing the impact of artificial class. Xu et al. [5] proposed to 151 extract features from the shortest dependency path (SDP) 152 using CNN to avoid irrelevant words in long-distance rela-153 tionships. Shen and Huang [6] proposed to use POS embed-154 dings along with word and position embeddings to improve 155 the performance.  Peng et al. [14] proposed to use a document graph to capture 192 in and across sentences. Then, they used graph-LSTM to 193 encode the input text. Zhang et al. [15] [20] proposed syntax-aware entity embed-220 dings based on tree-GRU for neural relation classification. 221 Reinforcement learning was used in [21] in relation classi-222 fication from noisy data. Inspired by Generative Adversarial 223 Networks (GANs), Zeng et al. [22] proposed a GAN-based 224 method for Distant Supervised Relation Extraction. Rein-225 forcement learning was also used by Zeng et al. [23] to learn 226 sentence relation extractors with the distantly supervised 227 dataset. The research community has also explored relation 228 extraction using deep learning techniques in the cross-lingual 229 setup. A pipeline in a relation extraction system was 230 developed in [24] for any source language. Lin et al. [25] pro-231 posed a multilingual attention-based neural relation extrac-232 tion (MNRE) model. This model employs mono-lingual 233 attention to select the informative sentences within each lan-234 guage. The proposed model employs cross-lingual attention 235 to take advantage of pattern consistency and complementarity 236 among languages. After the recent success of Transformer, 237 architecture [26], pre-trained models like BERT [27] are 238 being used by the research community for relation extrac-239 tion. In [28], the authors used pre-trained BERT models for 240 relation classification. They also leveraged the entity infor-241 mation in their proposed model to improve performance. 242 In a recent study, Liu et al. further extend this architecture 243 in [29]. To capture the latent information around the target 244 entities, the authors utilize piecewise convolution [30]. They 245 also employ the focal lass function [31] to solve the problem 246 of class imbalance. In another work, [32], a task agnostic 247 representation from entity linked text was proposed. The 248 authors' main goal is to learn a mapping from relation state-249 ment to relation representation. While most of the previous 250 work is based on English, very little work is done in Indian 251 languages. Recently a benchmark corpus is proposed in [33] 252 which consists of news data from disaster domains.

254
Event-Argument linking can be defined as a task to find out a 255 link between event and argument triggers which are marked 256 in a given sentence. We formulate the task as a classification 257 problem where we classify a sentence, marked with event 258 and argument triggers, into two labels namely '1' and '0'. 259 If event and argument triggers are linked, then predicted 260 label will be '1', or '0' otherwise. In the given example, 261 there are two event triggers, namely ' ' (earthquake) and 262 ' ' (tsunami) and three argument triggers which are 263 ' ', ' ' (Indonesia) and '' '' (71 people). 264  Marathi and Marathi to Hindi data for our current previous deep learning based models. 296 We describe features and deep learning architectures in next 297 section.

299
This section will discuss various deep learning architectures 300 we have used for building our models. In details, we will 301 also discuss all the three features that we have used in our The main drawback of using CNN is that it uses a single 325 max-pool operation on each filter to capture the essential 326 features. This strategy works well for the sentence classi-327 fication task. However, a single max-pool operation is not 328 sufficient for the relation classification task, where it is 329 equally important to model both entities' structural infor-330 mation. To address this problem, Zeng et al. [30] proposed 331 a novel Piecewise Convolutional Neural Network (PCNN). 332 To capture the structural as well as other latent information 333 between and around two entities, the authors divided the 334 convolution results from each filter into three segments based 335 on the given entity positions and performed a piece-wise 336 max-pooling operation instead of the single max-pooling 337 operation. The proposed max-pooling operation successfully 338 captures each part's maximum value, resulting in superior 339 performance than normal CNN. Similar to our CNN model, 340 we use three filters of kernel size 3,4 and 5 respectively for 341 the PCNN model and then perform a piece-wise max-pooling 342 operation on each filter ( Fig. 1(b)).

344
Recurrent Neural Network (RNN) is good at capturing the 345 contextual information in long texts. However, the RNN is a 346 biased model, where later words have more influence than 347 earlier words. CNN is introduced to cope with this bias 348 problem, which can identify the discriminative phrases in an 349 input sentence with its max-pooling layer. But, CNN has fixed 350 window size problem. It is tough to determine the optimal 351 window size to choose a trade-off between loss of crucial 352 information and large parameter space. To address the issues 353 of both the models, [36] proposed a Recurrent Convolutional 354 Neural Network (RCNN) for text classification ( Fig. 1(c)). 355 The proposed model comprises of a bi-directional recurrent 356 structure followed by a max-pooling layer. The recurrent 357 structure captures crucial information from a larger context 358 window than traditional CNN. Then the max-pooling layer 359 automatically judges the key features for the classification 360 task. Thus, the proposed architecture utilizes the combined 361 advantages of both RNN and CNN. We use Bi-directional 362 Long Short-Term Memory (Bi-LSTM) as the recurrent unit 363 in our current implementation, followed by a max-pooling 364 layer. LSTM is a variant of RNN with a 'memory cell' that 365 stores information in memory for a longer period. It also 366 has three gates that control the update, deletion, and output 367 of information. Better control over the gradient flow help to 368 solve the vanishing and exploding gradient problem.

381
The attention mechanism was introduced on the top of Recur- In such scenarios, information about how each word is related 403 to others is of utmost significance. We apply a self-attention 404 mechanism on top of the Bi-LSTM layer ( Fig. 1(f)). we formulate the event-argument linking task as a relation 414 extraction problem; we use the BERT-based relation extrac-415 tion architecture proposed by [28] with slight modification 416 (Fig. 2). We are working in multiple languages. So we use 417 multilingual BERT instead of vanilla BERT for our experi-418 ments. To use information about both the marked event and 419 argument triggers, we first perform an average operation 420 to both event and argument triggers marked in our input 421 representation to obtain the vector representations of both 422 of them. Then, we apply fully connected layers to each of 423 the two vectors followed by the tanh activation function. 424 Then we concatenate both the hidden representations with the 425 pooled output (CLS token) to obtain a final representation. 426 Then, we use a linear layer followed by a softmax function 427 for further classification as we perform in all the previous 428 models.
We also use another BERT-based architecture proposed 431 in [29] with minor changes. This architecture is an extension 432 of the previous BERT-based architecture proposed by [28]. 433 On top of the BERT representation, an additional Piecewise 434 Convolutional Neural Network (PCNN) [30] was employed 435 to capture the latent information between and around the 436 event and argument trigger in an input sentence. In addition, 437 instead of the vanilla BERT proposed in the original paper, 438 we employ multilingual BERT. We keep the focal loss func-439 tion [31] proposed by the authors of the main research article 440 in [29], even if the class imbalance problem is not applicable 441 in our instance.   we first calculate each word's relative distance with respect 487 to event and argument triggers, respectively. This relative dis-488 tance can be both positive and negative based on the position 489 of each word with respect to the trigger words. A random 490 vector of dimension 50 is chosen to convert each distance to a 491 vector. Fig. 3 shows how the relative distance is calculated for 492 each word with respect to event and argument triggers of the 493 input instance. Table 6, Table 7 and Table 8 show the effect 494 of PE. Segment ID (SI) help the model to identify whether the 504 event triggers and the argument triggers belong to the same 505 sentence or not. For the first sentence, the segment ID value 506 of each word is 0s, and for the second sentence, the same is 507 1s. This feature is important as most of the cases, event, and 508 argument triggers that belong to the same sentence are linked 509 to each other ('Yes' class). In contrast, event and argument 510 triggers that belong to different sentences are not linked to 511 each other ('No' class).

513
This section describes the datasets, hyper-parameter settings, 514 comparison to the baseline architectures, and ablation study 515 to decide the contribution of the different aspects of our 516 system. 517 VOLUME 10, 2022   The downloaded raw files are pre-processed and converted 540 into XML files. Similar to the Hindi dataset in [33], the anno-

549
The dataset statistics is described in Table 1.

551
In this section, we describe the experimental details.  We perform the experiments using the hyper-parameter set-579 tings described in Table 3. We optimize parameters using the

587
In this section, we describe the obtained results from our 588 experiments. We first perform the quantitative and qualitative 589 analysis of the monolingual experiments followed by cross-590 lingual experiments.

592
The primary observation from the Table 6, Table 7 and Table 8 593 is that, for Hindi and Bengali monolingual experiments, 594 RCNN and Bi-LSTM-CNN, with all features, namely posi-595 tion embedding (PE), position indicator (PI), and segment 596 ID (SI), produce similar or better performance compared to 597 BERT based model (refer Table 6 and Table 8). However, 598 the BERT-based Marathi model performs better than all non-599 BERT-based deep learning models in the monolingual exper-600 iments (refer Table 7). We observe that the combined set of 601 all features helps RCNN and Bi-LSTM-CNN, which take 602 into account long-distance contextual information from both 603 directions, perform at per with the BERT-based model. The 604 possible reason behind such performance could be that, apart 605 from semantic information embedded in the vector repre-606 sentation, both the above models are using contextualized 607 information similar to BERT. Moreover, the above models 608 use similar features like segment ID and position information 609 which are also present in the proposed multilingual BERT 610 models. We have also carried out an ablation study of various 611 features in non-BERT-based deep learning models to under-612 stand their usefulness. 613 We also observe that the performance of the Marathi lan-614 guage is best among all languages, and the performance of the 615 Hindi language is worst. This phenomenon can be understood 616 from Fig. 5, where it shows that the average length of sentence 617 and the average length between event and argument triggers 618 are highest in the case of Hindi and lowest in the case of 619 Marathi. Also, the average event and argument trigger lengths 620 VOLUME 10, 2022  Average monolingual model performance w.r.t intra and inter sentence event argument triggers. When both event and argument triggers belong to same sentence, then it is called intra-sentence event-argument triggers. When event and argument triggers belong to different sentences, then it is called inter-sentence event-argument triggers. are highest for Hindi and lowest for Marathi (refer Fig. 4).   Table 4 shows that, across all languages and all models, the 634 average length between event-argument triggers in correct 635 'Yes' cases are much smaller than the average length between 636 event-argument triggers in correct 'No' cases. In contrast, 637 for incorrect cases, the opposite situation is observed. Thus, 638 inter-sentence 'Yes' cases and intra-sentence 'No' cases are 639 more challenging than usual.  Table 2). Table 9 shows the performance of 647 various models in inter-sentence 'Yes' and intra-sentence 648   9. Accuracy of various monolingual models w.r.t intra and inter sentence event argument triggers. When both event and argument triggers belong to same sentence, then it is called intra-sentence event-argument triggers. When event and argument triggers belong to different sentences, then it is called inter-sentence event-argument triggers.
'No' cases. We observe that the accuracy percentage of intra-  cases, which are also challenging in monolingual linking, per-840 forms poorly, especially in the case of non-BERT-based deep 841 learning models (refer Table 12). Table 12 further reveals 842 that, for some cases, the accuracy is as low as 0. We also 843 observe performance decline of inter-sentence 'No' cases. 844 This observation is in line with previous analysis where we 845 observe that the models easily learn and follow the usual 846 scenario, which says nearer argument triggers have a higher 847 probability of being linked than distant argument triggers. 848 In other words, event and argument triggers in the same 849 sentences have higher chances to link to each other (intra-850 sentences 'Yes'), and event and argument triggers in different 851 sentences have higher chances not to be linked to each other 852 (inter-sentences 'No'). Table 11 shows that whatever perfor-853 mance that we achieve in cross-lingual experiments are due 854 to these two categories, which are easier to learn. Among the 855 other two cases viz. inter-sentence 'Yes' and intra-sentence 856 'No', the latter one is tough to predict from the position 857 and distance feature and requires semantic knowledge from 858 the text, which comes from the vector representation of the 859 words in the text itself. In monolingual word embeddings, 860 embeddings of both languages are in different spaces. How-861 ever, cross-lingual word embeddings try to solve the problem 862 by aligning the word vectors in common space (English in 863 our case). But in our case, we observe that the cross-lingual 864 word embeddings do not capture sufficient semantic infor-865 mation to transfer knowledge from one language to another 866 to predict intra-sentence 'No' cases. However, we observe the 867 superior performance of multilingual BERT in cross-lingual 868 knowledge transfer. One striking observation from Table 12 869 is that the performance of the Bi-LSTM-CNN classifier with 870 monolingual embeddings for inter-sentence 'Yes' cases is 871 astonishingly higher than the other classifiers. Detailed anal-872 ysis reveals that the classifier produces a very high recall 873

1161
PUSHPAK BHATTACHARYYA is currently a Pro-1162 fessor of computer science and engineering at IIT 1163 Bombay. He is an outstanding researcher in natural 1164 language processing and machine learning. He has 1165 contributed to all areas of natural language pro-1166 cessing, encompassing machine translation, sen-1167 timent and opinion mining, cross-lingual search, 1168 and multilingual information extraction. He has 1169 held the office of the Association for Computa-1170 tional Linguistics (ACL), the highest body of NLP, 1171 as the President. He has received several awards and fellowships, such as a 1172 fellow of Indian National Academy of Engineering, the IBM Faculty Award, 1173 and the Yahoo Faculty Award. 1174 1175 VOLUME 10, 2022