CAMM: Cross-Attention Multimodal Classification of Disaster-Related Tweets

During the past decade, social media platforms have been extensively used for information dissemination by the affected community and humanitarian agencies during a disaster. Although many studies have been done recently to classify the informative and non-informative messages from social media posts, most are unimodal, i.e., have independently used textual or visual data to build deep learning models. In the present study, we integrate the complementary information provided by the text and image messages about the same event posted by the affected community on the social media platform Twitter and build a multimodal deep learning model based on the concept of the attention mechanism. The attention mechanism is a recent breakthrough that has revolutionized the field of deep learning. Just as humans pay more attention to a specific part of the text or image, ignoring the rest, neural networks can also be trained to concentrate on more relevant features through the attention mechanism. We propose a novel Cross-Attention Multi-Modal (CAMM) deep neural network for classifying multimodal disaster data, which uses the attention mask of the textual modality to highlight the features of the visual modality. We compare CAMM with unimodal models and the most popular bilinear multimodal models, MUTAN and BLOCK, generally used for visual question answering. CAMM achieves an average F1-score of 84.08%, better than the MUTAN and BLOCK methods by 6.31% and 5.91%, respectively. The proposed cross-attention-based multimodal deep learning method outperforms the current state-of-the-art fusion methods on the benchmark multimodal disaster dataset by highlighting more relevant cross-domain features of text and image tweets.


I. INTRODUCTION
In the past decade, emergency managers and safety organi- 21 zations have started using social media platforms to share 22 critical information for planning and implementing rescue 23 operations during a disaster. The decision-makers utilize 24 timely, first-hand, and location-based messages posted by 25 eyewitnesses on social media platforms to deploy resources 26 and enhance their response efforts. Innovative use of these 27 platforms allows humanitarian teams to engage directly with 28 the affected public during all phases of disaster manage-29 ment. Among several social media platforms, Twitter is most 30 prevalent during natural disasters [1], [2]. Twitter text mes- 31 sages, called tweets that consist of up to 280 characters, give 32 The associate editor coordinating the review of this manuscript and approving it for publication was Chao Tan .
first-hand information about the event almost in real time. 33 A massive flood of tweets is generated on Twitter within 34 minutes of striking a disaster [3], [4], [5], [6]. With advancing 35 mobile technologies, text tweets are often accompanied by 36 related images or videos, providing complementary infor-37 mation to understand the disaster site situation better. Ana-38 lyzing these multimodal posts together for an event allows 39 the government authorities and humanitarian organizations 40 to assess the post-disaster situation from different angles and 41 perspectives to take appropriate action. While these tweets 42 provide crucial information during an emergency, filtering 43 informative and actionable messages from a vast pool of these 44 noisy messages is challenging [7], [8].  The limitation of existing early and late fusion meth-98 ods is that they assign a fixed weight to each modality  This allows the attention model to choose relevant, more 103 prominent, and complimentary features from each modality. 104 Our motivation for using the multimodal approach is to 105 explore the relationship between the two media and use them 106 harmoniously to achieve better results. The only constraint of 107 the attention-based method is the additional computation of 108 the attention weights that is outdone by the improved network 109 performance. 110 Based on the above discussion, the main contributions of 111 the present study are: 112 • We propose a deep multimodal network designed to 113 learn the prominent features from the textual and 114 visual modalities using a novel Cross-Attention Multi-115 Modal (CAMM) framework for the binary classi-116 fication of disaster tweets into 'informative' and ' 117 non_informative' classes. CAMM is designed to uti-118 lize the complementary information from the tweets' 119 textual and imaging modalities. The attention mask of 120 text modality is used to highlight the features of the 121 imaging modality. Our goal is to attenuate the image 122 features by determining the relationship between the 123 words in the tweet and different spatial regions in an 124 image. To the best of our knowledge, a cross-attention-based, 143 multimodal fusion approach has not yet been explored in the 144 context of social media disaster data classification.

145
The rest of the paper is structured as follows: In Section II, 146 we discuss the research work related to unimodal and mul-147 timodal techniques for disaster management proposed in the 148 recent past. Section III covers the architecture of the proposed 149 deep multimodal neural network CAMM. A brief overview 150 of two baseline multimodal models is given in section IV. 151 The experimental setup in Section V includes the dataset, 152 metrics, hyperparameters, and the baseline methods used for 153 performing the experiments. The implementation details of 154 the experiments performed under various setups are given in 155 section VI. We list and discuss the results obtained after train-156 ing the networks under five different setups in section VII. 157 Finally, in section VIII, we discuss the limitations and future 158 scope of the work.      In a recent study by Ahadzadeh and Mohammad [52], the 232 machine learning methods Support Vector Machine, and 233 Naïve Bayes are applied to tweet images to assess the 234 damage done due to earthquakes. Studies by Khattar and 235 Quadri compared the simple transfer learning, unsupervised 236 domain adaptation, and semi-supervised domain adaptation 237 approaches applied to the natural and biological disaster 238 image datasets [16], [53]. Robertson et al. [54] fine-tuned 239 pre-trained model VGG-16 on Hurricane Harvey images to 240 classify them on an 'urgency' and 'time-period' basis. In a 241 similar study, Li et al. [15] applied Domain Adversarial 242 Neural Network (DANN) on four disaster images for binary 243 classification into 'Damage' and 'No-damage'. Under the multimodal analysis, Gautam et al. [25] proposed 247 a diffusion method for the classification of Twitter data 248 (text and images) of seven disasters of the CrisisMMD 249 dataset [33] into two classes 'informative' and 'non-250 informative' and compared their model with the unimodal 251 models based on text-only and image-only modalities. For 252 text-only modality, they applied N-gram, LSTM, BiLSTM, 253 and CNN+Glove methods, and for image-only modality, they 254 used six pre-trained models VGG-16, VGG-19, ResNet50, 255 InceptionV2, Xception, and DenseNet for transfer learning. 256 Finally, they compared the results based on three Logistic 257 Regression Decision policies. Their results confirm that the 258 logistic regression decision policy with bigram for text and 259 ResNet50 for images gives the best results.   325 We propose a novel architecture to build a binary classifier 326 that integrates the information about the same event expressed 327 in two different ways in the form of words and pictures. 328 Fig. 1 shows the complete architecture of the proposed mul-329 timodal DCNN that uses annotated text and image tweets 330 posted on Twitter during seven disasters. 331 We are given a tweet text T and a tweet image I, and we 332 need to fuse the features of T and I to predict the final class 333 as 'informative' or ' non_informative'. As commonly done in 334 multimodal architectures, the text T and image I are first con-335 verted to vector representations. Then these representations 336 are fused to extract the most meaningful interactions between 337 the text and the image to get the predicted class. In this study, 338 we propose a new fusion technique called Cross-Attention 339 Multi-Modal (CAMM) fusion, where the features extracted 340 for each word in a tweet ''attend'' to different spatial regions 341 of the image features. Our motivation for this approach is 342 that different words in a tweet can accentuate relevant image 343 features that significantly improve the model's performance. 344 As shown in Fig. 1, we use the pre-trained model VGG-16 345 to extract features from the input image I. The output of 346 convolution layers is passed through the Tanh() activation 347 function to limit the range of features between -1 and 1. For a 348 tweet T with n words, we use a Bi-LSTM with two layers to 349 learn hidden representations of dimension d T for each word. 350 Finally, we represent the image and text features by F I and 351 F T , respectively.

352
In self-attention, a feature vector is generated for each word 353 in the string, then the three weight matrices W K , W Q and W V 354 are used on the words to extract the key, query, and value vec-355 tors for each word. In the proposed cross-attention structure, 356 our goal is to attenuate the image features by determining 357 the relationship between the words in the tweet and different 358 spatial regions in an image. The W K matrix is used to extract 359 key values for each word in the tweet, and W Q is used on 360 feature vector of an image obtained using a CNN. The key 361 vectors obtained for each word and query vectors obtained for 362 each spatial region in an image are then combined to create 363 the attention given by the following equation,      We have filtered only those tweets with the same label 449 for the text and the corresponding image for the present 450 study. Table 1 gives the details of the filtered dataset with 451 12762 tweets each for the image and the text modalities, out 452 of which 8463 are informative, and 4299 are non-informative. 453 The filtered dataset is further split into the train, val, and test 454 set in the ratio 80:10:10 for training, validation, and testing 455 purpose. Most disaster-related studies done in the recent past 456 based on multimodal data analysis have used the CrisisMMD 457 dataset.  Although this method is relatively slow, it helps find the best 495 values for the network parameters. The parameters selected 496 through grid search for training all the models include: learn-497 ing rate as 1.00e-03, weight decay as 5.00e-04, momentum 498 as 0.9, loss function as weighted CrossEntrpoyLoss and opti-499 mizer as Stochastic Gradient Descent (SGD). We performed 500 50 epochs for the baseline models and 100 epochs for the 501    After the preprocessing step, the words in the tweet need to 533 be represented as real-valued vectors for further processing. 534 We have used pre-trained word embedding GloVe (Global 535 Vectors for Word Representation) [56] to get the word embed-536 dings. GloVe converts the words into vectors so that sim-537 ilar words have similar vector representations. To capture 538 complete information from the word, we used the GloVe 539 embedding of dimension 300. Once the vector matrix of tweet words is obtained, we use 542 the Bi-LSTM model to extract the features for classification. 543 This study uses two LSTMs, one in the forward and one in 544 the backward direction. LSTM model was first proposed by 545 Hochreiter et al.

IV. A BRIEF OVERVIEW OF BASELINE MULTIMODAL
[57] to handle the shortcomings of Recur-546 rent Neural Networks, which could not handle the long-547 term dependencies. LSTMs are designed to remember the 548 information for a longer time through a series of LSTM units. 549 Each unit of the LSTM has a forget gate, input gate, output 550 gate, and cell state. The forget gate consists of a sigmoid 551 function that outputs a number between 0 and 1 depending 552 on the previous and the current state. A '0' represents discard 553 or forget, and a '1' represents keep or remember. The input 554 gate also has a sigmoid function that decides which values to 555 be updated, and the tanh function provides the new updated 556 values resulting in the output for the next hidden state. These 557 gates allow the model to keep only the critical information 558 and forget the rest. We have used two layers of LSTM, which 559   We follow the same preprocessing technique for CAMM as 592 mentioned in unimodal classification for text and image data. 593 The output of the convolution layers of VGG-16 for an input 594 image is of dimensions (7, 7, 512) which are passed through 595 an additional 1 × 1 convolution and Tanh() activation to 596 increase the number of channels to 1024. Each cell in the 597 7 × 7 matrix represents features for different spatial regions 598 in the input image, which are subsequently reshaped to 599 49 × 1024 which represents F I .

600
Bi-LSTM takes the GloVe embeddings as input for all the 601 words in a tweet to generate features of dimension 1024 for 602 each word represented as F T = (n, 1024) where n is the 603 number of words in a tweet. The two feature vectors F I a nd 604 F T are then used to generate the key, query, and value vectors 605 where W K , W Q and W V are three separate fully connected 606 layers with input and output size 1024. Next, the key and 607 query vectors are used to generate the attention map M (refer 608 to 1), which represents the relevance of each word in a tweet 609 against different spatial regions of the image. Finally, this 610 attention map is applied to the value vector to obtain a fused 611 feature vector F CA (refer to 2). The final fused vector is passed 612 through a linear classifier consisting of three fully connected 613 layers with a ReLU activation in between, and the output of 614 the last layer is passed through a SoftMax function to get the 615 probabilities assigned to each label.

617
To validate the performance of the proposed model CAMM, 618 we conduct extensive experiments on the benchmark multi-619 modal disaster dataset CrisisMMD and compare the results 620 with baseline unimodal and multimodal methods and also 621 VOLUME 10, 2022  as shown in Fig. 6(a). Thus, we can see a clear progression 648 from unimodal to multimodal classifiers, with the best F1-649 score achieved by CAMM. We also compare the five model's 650 AUC metric taken as the average of all disasters. Fig. 6   We also compare the results of CAMM with recent state-of-660 the-art multimodal models in Table 6. The multimodal fusion 661    The above discussion confirms the following:  fusion for text and image modalities. The proposed 692 method is designed to select the prominent features 693 from the two modalities which are more relevant for 694 the task resulting in a better classifier. 695 We have used the following abbreviations for naming the 696 disasters: For the proposed CAMM architecture, the hyperparame-703 ters are fine-tuned using grid search. The results of exper-704 iments performed for selecting the backbone architecture 705 are shown in Table 2. We trained the network for Hurri-706 cane Harvey image dataset with EfficientNet-B3, ResNet50, 707 DenseNet201, VGG-16 and VGG-19 as backbone architec-708 tures. The results confirm that VGG-16 achieves the highest 709 F1-score of 78.19% and hence the best choice for all the 710 experiments performed in this study.

721
In this study, we proposed a novel Cross-Attention Multi- F1-score of 84.08%, which is 6.31% better than the F1-score