A Deep Attentive Multimodal Learning Approach for Disaster Identification from Social Media Posts

Microblogging platforms such as Twitter have become indispensable for disseminating valuable information, especially at times of natural and man-made disasters. Often people post multimedia contents with images and/or videos to report important information such as casualties, damages of infrastructure, and urgent needs of affected people. Such information can be very helpful for humanitarian organizations for planning adequate response in a time-critical manner. However, identifying disaster information from a vast amount of posts is an arduous task, which calls for an automatic system that can filter out the actionable and non-actionable disaster-related information from social media. While many studies have shown the effectiveness of combining text and image contents for disaster identification, most previous work focused on analyzing only the textual modality and/or applied traditional recurrent neural network (RNN) or convolutional neural network (CNN) which might lead to performance degradation in case of long input sequences. This paper presents a multimodal disaster identification system that utilizes both visual and textual data in a synergistic way by conjoining the influential word features with the visual features to classify tweets. Specifically, we utilize a pretrained convolutional neural network (e.g., ResNet50) to extract visual features and a bidirectional long-term memory (BiLSTM) network with attention mechanism to extract textual features. We then aggregate both visual and textual features by leveraging a feature fusion approach followed by applying the softmax classifier. The evaluations demonstrate that the proposed multimodal system enhances the performance over the existing baselines including both unimodal and multimodal models by attaining approximately 1% and 7% of performance improvement, respectively.


I. INTRODUCTION
In times of disaster events such as earthquake, flood, and hurricane, social media platforms can play a critical role in spreading a large volume of important information [1]- [3]. People frequently use these social media platforms to communicate at different hierarchies such as from individual to individual, individual to government, individual to community and government to people [3], [4]. Victims often share information about disaster events on Twitter, such as reporting about injured or deceased people, and infrastructural damages. Affected people also inquire for urgent aids by posting images, tweets, and videos. Analyzing such social media posts and extracting actionable insights in real-time can be very helpful for humanitarian organizations to assist the affected people [5], [6]. However, it is very difficult and time-consuming task to manually analyze and extract actionable insights from large amount of crisis-related tweets.
The humanitarian computing community has attempted to address the above challenge by developing automated systems that can extract and classify crisis-related social media posts [7]- [9]. For example, researchers have develop classifiers to identify event types (e.g., flood, hurricane) [10], whether a post is informative or not [11], as well as humanitarian information types (e.g., types of damages) [12]. Despite such recent progress, existing works are primarily limited in two ways. First, most works on for damage or disaster response from social media posts have mainly concentrated on textual or image content analysis independently. However, recent studies suggest that information from both texts and images often provides valuable insights about an event and thus leads to more precise inferences than the learning from unimodal content [13]. Second, a very few works that utilize multimodal features focus on applying CNN or RNN models for text feature representation [7], [8], which might not work well for longer sentences.
In this work, our goal is to develop an effective computational model for identifying disaster-related information by synergistically integrating features from visual and textual modalities. More specifically, we extract the image features using pre-trained visual (i.e., ResNet50) model. We also extract the textual features by integrating an attention mechanism with the BiLSTM network to address the long-range dependency problem with traditional RNN and CNN architecture. We then aggregate both types of features using the Deep level fusion, followed by applying the softmax layer to classify the given tweet. We perform extensive experiments on a multimodal damage dataset, where the goal is classify damage type (e.g., fire, floods, infrastructure damage) from an image-tweet pair. We compare our models with several baselines that do not utilize multimodal features or do not apply attention mechanisms (Section IV). The key findings from the these experiments are: (i) utilizing multimodal features is more effective that uni-modal features, and (ii) the RNN model with an attention mechanism can be very effective in improving the performance compared to its counterpart that does not incorporate such mechanism.
The primary contributions of our work are: • We propose a multimodal architecture that utilizes ResNet50 and BiLSTM recurrent neural network with attention mechanism to classify the damage-related posts by exploiting both visual and textual information. • We compare the performance of the proposed model with a set of existing unimodal (i.e., image, text) and multimodal classification techniques. • We empirically evaluate the proposed model on a benchmark dataset and demonstrated how introducing attention could enhance the system performance through an intrinsic evaluation. • We perform both quantitative and qualitative analysis to get deeper insights about the error types which provide future directions for improving the model.
The remainder of the article is organized as follows. First, we provide an overview of related research on disaster tweet classification in Section II. Next, we present our proposed method in Section IV. We then present our experimental setup, key findings, and errors analysis in Section V. Finally, we conclude the paper with the possible future directions in Section VI.

II. RELATED WORK
A significant amount of work has been done to classify, extract, and summarize disaster-relevant information from social media, see [14] for a detailed survey. Here, We broadly categorize computational models that are closely related to our damage/disaster classification task in two ways: (i) unimodal approaches which consider either text or images, (ii) multimodal approaches which consider both type of information. We discuss both types of approaches below.

1) Text-based Disaster Identification
Many previous studies have utilized social media texts, and leveraged it for damage or disaster identification [15]. Early works focused on feature-engineering based approaches and used models such as support vector machine (SVM) [16], random forest [17], and logistic regression classifiers [18]. Later, researchers have widely used deep learning-based architectures such as CNN [19], and BiLSTM [20] for classifying the disaster-related tweets. Caragea et al. [21] and Nguyen et al. [19] proposed CNN-based models to classify the tweets into informative and not-informative categories which provides significant improvements over feature engineering-based approaches. Aipe et al. [22] also proposed a CNN-based approach but they focus on multilabel classification rather that simple binary classification to label disaster-related tweets. Similarly, Yu et al. [23] used CNN, logistic regression, and SVM to classify the tweets related to different Hurricanes into multiple categories. Their CNN-based model outperformed SVM and LR. In contrast to CNN-based approaches we consider BiLSTMs with attention mechanisms with an aim to better captures dependencies between word tokens.
Some researchers have focused on domain adaption and cross-domain classification [24], [25]. Li et al. [24] studied the feasibility of domain adaption for analyzing the disaster tweets by applying the naive Bayes classifier on the Boston Marathon bombing and Hurricane Sandy dataset. Graf et al. [25] focused on cross-domain classification so that the classifier can be used across different types disaster events. They employed a cross-domain classifier and utilized emotional, sentimental, and linguistic features extracted from the damage-related tweets. Others have focused on text mining and summarization approaches [26], [27]. For example, Rudra et al. [26] assign tweets into different situational classes and then summarizes those tweets. Cameron et al. [27] proposed an Emergency Situation Awareness-Automated Web Text Mining (ESA-AWTM) system that detects informative damage-related Twitter messages to inform charitable organizations about the incidents of a disaster. Unlike these systems that broadly focused on text mining and summarization, we only focus specifically on a multi-class classification problem on disaster-related tweets.

2) Image-based Disaster Identification
Most works on identifying disasters from social media images have applied CNN-based classifier. For example, Chaudhuri and Bose [28] used CNN-based model to locate the human body parts from the wreckage images. Nguyen et al. [29] developed a deep CNN architecture to label the social media images into multiple disaster categories (i.e., severe, mild, and no-damage). Similarly, Alam et al. [30] proposed a pretrained CNN (VGG16) based framework that can identify the disaster images uploaded on the online platforms. Daly and Thom [31] culled flicker images to detect the fire event using pretrained classifiers. Finally, Lagerstrom et al. [32] developed a system to classify whether the image indicates a fire event or not. In contrast to these works that broadly developed binary classifier for classifying disaster vs. non-disaster images using CNN approach, we focus on identifying multiple disaster categories from the disasterrelated images.

B. MULTIMODAL APPROACHES
In recent years, researchers have used multimodal data (i.e., image and text) to find disaster related information from social media, as information from both modalities often provide valuable insights for disaster classification. Most of the works employed fusion-based [33] approach to aggregate the multimodal features. Chen et al. [34] studied the relation between the images and texts and utilize visual features along with socially relevant contextual features (e.g., time of posting, the number of comments, retweets) to identify disaster information. Mouzannar et al. [7] explored damage detection by focusing on human and environmental damage related posts. They used the Inception pre-trained model for visual feature extraction and designed a CNN architecture for textual features. Similarly, Rizk et al. [35] proposed a multimodal architecture to classify the Twitter data into infrastructure and natural damage categories. Ferda et al. [8] also presented a multimodal approach for classifying the tweets into two categories: informative task (e.g., informative vs. non-informative) and humanitarian task (e.g., affected individuals, rescue volunteering or donation effort, infrastructure and utility damage). They used CNN based approach for extracting the visual and textual features. Gautam et al. [36] showed a comparison between unimodal and multimodal methods on CrisisMMD [37] dataset. They utilized the late fusion [38] approach for combining the image-tweet pairs. All the works reported significant performance improvement using multimodal information in contrast to their counterparts that utilize uni-modal information.
Motivated by the success of these multimodal approaches we focused on effectively utilizing features from text and images using BiLSTM and CNN models and then fusing them to form a joint representation for the classification. However, unlike the above multimodal-based approaches which used simple CNN/RNN models or n-gram features, we extract the textual features using the BiLSTM network with attention mechanism to address the long-range dependency problem.

III. PROBLEM FORMULATION AND THE DATASET
In this work, our goal is to automatically classify disaster types such as floods, fires, earthquake etc. from social media posts. Formally, we are given a dataset with M examples, thus the i th sample can be represented as Here, P i , and Y i denotes the post, and the associated class label for the i th data point. Each post P i consists of two modalities: visual (v i ) and textual (t i ). Our model utilizes both v and t simultaneously to classify the P i into one of the K classes. We discuss the disaster types and analyze the dataset below.

A. DISASTER TYPES
We experiment with a benchmark multimodal damage dataset 1 from Mouzannar et al. [7], which consists of damage-related images along with their associated tweets. The dataset contains following five different categories of disaster image-tweet pairs as well as one category of nondamage (ND) image-tweet pairs.

B. DATASET ANALYSIS
The dataset from Mouzannar et al. [7] consists of a total of 5, 831 image-tweet pairs, were the training and test sets contain 5, 247 and 584 samples, respectively. The class-wise breakdown of the train and test sets is reported in Table 1. We   We have also analyzed the basic linguistic statistics including token statistics and tweet lengths. Table 2 shows that the average number of words per tweet are over 28 in all classes. We also notice that the ND class contained the highest number of total and unique words as this class has the maximum number of instances (2, 666) in the dataset. On the other hand, HD has lowest number of total words and unique words. Figure 1 further shows how the tweet length varies across the different classes. We observe that generally there are more shorter tweets than the longer ones and the most tweets contained less than 100 word. Overall, this distribution provides an idea of choosing the input text length during the training phase. Figure 2 depicts our proposed multimodal architecture for disaster identification. The model consists of two parallel networks: one for visual feature extraction and another for textual feature extraction. We apply a pre-trained convolutional neural network (i.e., ResNet50) to extract the visual features and a BiLSTM model with the attention mechanism to obtain the textual features from the tweets. The features from both modalities are then aggregated to form a combined representation and passed into a softmax layer for classification. A brief description of each constituent part of the architecture is described in the following subsections.

A. DATA PREPROCESSING
We pre-process the image by resized it into 150 × 150 × 3 so that all images have the same size and also in this way we can then process these resized images more efficiently. Furthermore, the image pixels are scaled between 0 and 1 to reduce the computational complexity during the classifier model training. Concerning the textual modality, we discard all the hyperlinks in a tweet as well as all some special characters (e.g. !,@,$,%,&), punctuation symbols, and emoticons.

B. VISUAL FEATURE EXTRACTION
We apply the transfer learning [39] technique to obtain the visual features from the image. To this end, we use the pretrained ResNet50 [40] model mainly because it can address the vanishing gradient problem by utilizing skip connections across different layers [41]. In order to adjust the ResNet50 for our task, we exclude the top two layers of the default model. We freeze initial 40 layers to use only the weights of the higher level visual features that were previously learned from the ImageNet [42] task. For the last 10 layers of the ResNet50 model, a global average max-pooling layer, and a dense layer, we retrain the model with new weights. The dense layer compute the visual features according to the following equation.
Here, V (i) f ∈ R 1×d represents the visual semantic expression extracted by the ResNet50 for v th image and d denotes the number of hidden neurons in the dense layer. Also, G j indicates j th feature map generated by the global average max-pooling layer, W kj denotes the weight matrix, b k represents the bias vector for k th dense node, and Relu is the activation function, respectively.

C. TEXTUAL FEATURE EXTRACTION
For textual feature extraction, we first transform the tweet into a vector representation and then use an embedding layer to obtain semantic representations (embedding features) of the words. We then feed the embedding features to the BiLSTM network, which produces the context-level feature vector for individual words. Finally, the attention layer finds the most significant textual features from this feature vector. We now describe each of these steps in details.

1) Text to Vector Representation
In order to generate an initial vector representation of the tweet, we first generate a numeric mapping of the words of τ [] = {t 1 , t 2 , ..., t M }, where t i represents a tweet/text. To get this mapping, we first create a vocabulary (V ) consisting of ν unique words as V = {uw 1 , uw 2 , ..., uw ν }. The i th words in a tweet t j = [w 1 , w 2 , ..., w l ] is substituted by the corresponding index number ((i)) of the words in V . By doing so, a tweet (t j ) is transformed into a sequence vector, s = [i 1 , i 2 , ..., i l ]. However, at this point, the obtained sequence vectors, S = {s 1 , s 2 , ..., s M } have variable lengths (l ), which is not appropriate for feature extraction and training. Therefore, we transform S into fixed-length sequences, S = {s 1 , s 2 , ..., s M }, where a sequence (s k ) of S is a vector of size l. We choose l = 150 empirically based on the observation that most tweets in the dataset contain less than 100 words; therefore choosing such large number of dimensions for the vector would allow us to sufficiently capture the important information from different tweets.

2) Embedding Layer
After creating the initial vector representation S, it is necessary to encode the semantic information of the words (w i ) of a tweet to a global vector s e k . For this purpose, we first pass each sequence vector s k in S into the Keras embedding layer to obtain word embedding vectors (we i ). We then simply concatenate these word embedding vectors according to the equation 2 to preserve the sequence of words.
Here, we i ∈ R 1×ed represents the embedding vector of the i th word. We keep the size of the embedding dimension large enough to capture the relationship between words (ed = 100).

3) BiLSTM Layers
We apply the Bidirectional LSTM to generate a contextual representation of the input text from both backward and froward directions. Bidirectional LSTM [43] is an extension of long-established LSTM RNN architecture which is suited for abating the vanishing gradient problem that occurs due to the long context size. The model process the tweet from we 1 to we l by the forward LSTM and from we l to we 1 by the backward LSTM. For each word w i , a forward LSTM generates the word feature h i , and a backward LSTM generates the word feature h i using its memory blocks. The combined features h i is calculated by Eq. (3).
Here, h i ∈ R 1×2N denotes the BiLSTM feature generated for i th word in the p th layer, where N represents the number of hidden units in the LSTM cell. The ⊕ is the concatenation symbol.

4) Attention Layer
Generally, all words in a tweet do not contribute equally in deciding whether the tweet should belong a particular class. Therefore, we utilize the attention mechanism [44] to emphasize on the most important words during the classification. The attention mechanism assigns a weight att j to each individual word feature h j of the BiLSTM layer with a focus on the output labels. Finally, we perform a weighted sum operation to generate an attentive feature vector av t of the t th tweet. More formally, Here, l is the length of a tweet and h [2] i is the word feature vector obtained in the second BiLSTM layer, which is passed to a two-layer neural network to get the e i as a hidden representation of h [2] i . The weight matrix W and bias vector b is initialized during the neural network training. The influence of the words can be measured by calculating the similarity between e i and a randomly initialized word-level context vector c w . Afterward, by using the softmax function a normalized weight att i is obtained for each word (i) in a tweet (t). The attention weights for a tweet, l i att i = 1. The larger the weight of att i is, the more significant the word for classification. Finally, the attentive feature av t for a tweet is fed to a dense layer consisting of d number of neurons. The output can be represented as in Eq. (7).
Here, T (t) f ∈ R 1×d represents a d-dimensional feature vector that resembles the t th tweet feature, where d is the number of hidden neurons in the dense layer. Correspondingly, W kj , b k , and Relu are represented as the weight matrix, bias vector, and the activation function.

D. DEEP LEVEL FUSION AND CLASSIFICATION
In order to create a shared representation of both modalities, we concatenate the output's of the dense layer obtained from the visual (V f ) and textual (T f ) modalities. To attain the deep level representation, we utilize an early fusion approach [45] which concatenate the visual and textual features. We use the same number of hidden nodes (d) in the last dense layer of both modalities. We select the same size to have an equal contribution from both the visual and textual sides. We set d = 200 empirically based on the highest accuracy on the validation set.
Suppose, the dataset contains M number of posts, where each post (P i ) consists of two types of information: visual (v i ), and textual (t i ). The visual, and textual feature extractor finds the visual, and textual feature as vectors V (i) f and T (i) f , respectively. Then, the fusion of these vectors is computed as in Eq. (8).
Where F F (i) ∈ R 1×2d is the concatenation (⊕) output of i th visual and textual features. We then pass the fusion feature vector through the final hidden layer of n hidden neurons followed by a softmax layer for the classification. To mitigate the effect of overfitting a dropout [46] layer is added before the hidden layer. The process is illustrated in Eqs. (9) and (10).
Here, F (i) h ∈ R 1×n represents the final hidden layer output, where n = 50. The parameters W qj and b q are the weights and biases of the hidden layer and K represents the number of classes for the classification task.

E. MODEL HYPERPARAMETERS
We use the Keras tuner [47] to optimize hyperparameters including learning rate and batch size. We first configure the search space with different values for each hyperparameter (e.g. optimizer, learning rate, etc.) and then leverage the Hyperband [48] search algorithm to find the best hyperparameter values for the proposed model. The values are adjusted based on their impact on the validation set performance (i.e., accuracy). However, to reduce the computational cost, other hyperparameters such as number of hidden units, number of LSTM cells, dropout rate, and embedding dimension are not considered as those are empirically selected. Table 3 shows the optimized hyperparameter values of the proposed model.
The proposed model is compiled using the categorical cross-entropy loss function and adam optimizer with a learning rate of 3e −3 . Furthermore, the model's training is performed for 100 epochs with 64 instances at each iteration. Additionally, the Keras checkpoint method has been utilized to stop the over training of the model by observing the validation accuracy up to five consecutive epochs. The code of this work is available at the link 2 .

V. EXPERIMENTS AND ANALYSIS
In this section, we first describe the baseline models that we compared with. We then present the comparative performance analysis of the proposed approach with these baselines. Finally, we provide an in-depth error analysis along with intrinsic performance analysis.

A. BASELINES
We consider three types of baselines based on the features they use: (i) visual only, (ii) textual only, and (iii) visual + textual.

1) Visual Only
For the visual (i.e., image) modality, we consider two state-of-the-art pertained CNN architectures: VGG19 and InceptionV3 along with the ResNet50 (Described in Section IV-B). These architectures are used for a wide range of image classification tasks. A variant of the VGG [49] model, VGG19 consists of 19 convolutional layers using a fixed kernel of size 3×3 at each layer. In contrast, InceptionV3 [50] is an advanced version of GoogLeNet [51], having several inception modules. Each module is associated with a series of stacked convolutional filters (1 × 1, 3 × 3, 5 × 5), making the architecture more robust in learning with fewer parameters. We excluded the top layers from both architectures and froze the initial layers except the last 10 layers of the networks to accomplish the task. We used the pre-trained weights of the initial layers while we retrained the last 10 layers and a global average max-pooling layer with new weights. Finally, a softmax layer is added for the classification.

2) Textual Only
We apply the following deep learning models for classifying the damage types using only the textual features: BiLSTM [52], CNN [53], BiLSTM + CNN [54], and BiLSTM + attention (A) [55]. We utilize the word embedding features with each model. The Keras embedding layer is initialized with the embedding dimension of 100 and settled the input text length at 150. The calculated features are then passed to every model. The BiLSTM network consists of one layer with 128 hidden units. The final hidden state output of the BiLSTM layer is transferred to a softmax layer for the classification. Similarly, CNN architecture is constructed, having one convolutional layer with 128 filters of kernel size 2 and a max-pooling layer of window size 2. A flattening layer is added before perform the softmax classification. Subsequently, BiLSTM + CNN network is configured by stacking the BiLSTM and CNN architecture with the same parameters. Eventually, the output of the stacked network is passed to the softmax layer. In another architecture, an attention layer is added after the BiLSTM layer and creates BiLSTM + Attention model (Described in Section IV-C4). The obtained attention vector is then passed into a dense layer of 20 neurons, followed by a softmax layer for the classification.

3) Multimodal (Textual + Visual) based Models
We experimented with 11 different models for combining text and image modalities, namely VGG19 + BiLSTM, VGG19 + CNNText, VGG19 + BiLSTM + CNN, VGG19 + BiLSTM + Attention, Inception + BiLSTM, Inception + CNNText, Inception+ BiLSTM + CNN, Inception + BiLSTM + Attention, ResNet50 + BiLSTM, ResNet50 + CNNText, and ResNet50 + BiLSTM + CNNText. Instead of using a softmax layer at the end of each model, a hidden layer of 200 neurons is placed. Subsequently, the hidden layers from both visual and textual sides are concatenated using the early fusion approach (Described in Section IV-D) to produce a shared representation of both modalities. We then pass the joint representation into a dense layer of 64 neurons, followed by a softmax layer. After the concatenation operation, we add a dropout layer (dropout rate = 10%) to abate the chance of layer overfitting.

B. IMPLEMENTATION SETTINGS
All the visual and textual models are compiled using the 'Adam' optimizer with a learning rate of 1e −5 and 1e −4 , respectively. In cotrast, for multimodal case, the models having VGG19 and Inception are utilized 'RMSProp' optimizer, where the learning rate is settled at 1e −3 . In contrast, multimodal models having ResNet50 are complied using 'Adam' (learning rate = 3e −3 ) optimizer. Other hyperparameters (i.e., loss function, batch size, epochs) and training configuration (i.e., Keras checkpoint) kept the same as described in Section IV-E.
Training and testing of the models are conducted on the Google Colab platform using Python = 3.6.9. Models are implemented using Keras = 2.4.0 with TensorFlow = 2.3.0 framework. For data preparation and evaluation, pandas =3.6.9 and Scikit-learn=0.22.2 packages have been used. We use 10% of the training dataset for validation and the remaining data for training. Finally, the trained models are evaluated using the test set instances.

C. EVALUATION MEASURES
For performance comparison, we use precision (P), recall (R), and weighted F1-score. For efficient comparison VOLUME 4, 2016 of the model's performance across different classes, the misclassification rate has been used as one of the measures. The definition of each measure illustrating in the following: recall (F = 2P R P +R ). However, considering the data imbalance scenario, we also calculate the weighted F1score (WF), which is defined as, here, T S, F j and n j denotes total samples in test set, F1score and number of samples in class (j) respectively.
We use the weighted F1-score metric to determine the superiority of the models. However, we also report P, R, and MR for deeper analysis of the model's performance on the individual classes. Table 4 shows the results of both unimodal (textual only and visual only) and multimodal (textual+visual) models. We observe that among the models that utilize the visual modality only, VGG19 and ResNet50 perform slightly better than the Inception model in terms of weighted F1. Texual models perform better than their visual only counterparts. Among them, the CNNText and BiLSTM+CNNText perform similarly. Interestingly, the performance is dramatically increased by 3.18% when the Attention is incorporated with BiLSTM (BiLSTM+Attention) compared to the BiLSTM only model. Overall, this suggests the usefulness of incorporating attention mechanism for our disaster type classification task.

D. RESULTS
Among the models that aggregate both visual and textual features, only two models performed better than the best unimodal counterpart (BiLSTM+Attention). In particular, the multimodal model (VGG19 + BiLSTM + Attention) showed a noticeable rise in WF-score (89.19%). Our proposed method (ResNet50 + BiLSTM + Attention) achieves state-ofthe-art result by achieving the highest WF-score of 93.21% with a margin of 4.02% over VGG19 + BiLSTM + Attention.
To verify the efficacy of the of the models' performance, we performed 5-fold cross validation [56]. The cross validation has been performed with seven models including the proposed (ResNet50+BiLSTM+Attention), best visual (ResNet50), best textual (BiLSTM+Attention), and best four multimodal models (i.e., ResNet50+BiLSTM+CNNText, Inception+BiLSTM+Attention, ResNet50+BiLSTM, and VGG19+BiLSTM+Attention). Table 5 shows the results of cross-validation. The results exhibits that the proposed model achieved the highest mean weighted f1-score of 93.10% with a deviation of 0.36031. The standard deviation and mean values of the models revealed that the different split of the dataset has not significant impact on the model's  performance.

1) Investigating Classification Reports
To obtain deeper insights about the performance of different models on the individual classes, we examine their classification reports (Figure 3). In the case of the best visual model (ResNet50), DI, HD, and ND classes obtained the high precision values (0.804, 0.736, 0.891) and low recall scores (0.770, 0.666, 0.872) (shown in Figure 3(a)). These results indicate that some instances of each class are incorrectly identified as other classes. In contrast, the DN, Fires, and Flood classes attain high recall and low precision scores. We also notice that the classes such as DN, Flood, and HD have relatively low f1-scores of 0.632, 0.72, and 0.70 respectively, compared to the DI (0.787), Fires (0.846), and ND (0.881).
In the case of the best textual model (BiLSTM + Attention), the overall performance is increased across various classes as depicted in Figure 3(b). However, he f1-score of the Fires class is surprisingly reduced from 0.846 to 0.75, which suggests that visual information is more critical than the textual one for this particular class. Finally, in the case of the proposed model, the precision and recall score of all the classes are improved by considerable margins compared to the best textual model (BiLSTM+Attention). Overall, The results showed that the proposed approach (ResNet50 + BiLSTM + Attention) outperformed all the visual, textual, and multimodal models in classifying the disaster information. ResNet50 obtained the highest WF score among the visual models, while BiLSTM+Attention attained the maximum WF score among the textual only models. We also notice that models that utilize both textual and visual features do not necessarily improve the performance over the textual only models. This indicates that the superior performance of our model was primarily due to the incorporation of attention in the textual modality side which effectively captures the input text.
To analyze further the cases where our model makes a difference, Figure 4 shows the confusion matrices of different models. We notice that the visual only model (ResNet50) wrongly classified 18 instances as 'Non Damage class' (ND) out of 144 instances of 'Damage Infrastructure' (DI) in Figure 4(a). In contrast, the textual and proposed models incorrectly predicted 6 and 3 instances, respectively ( Figures  4(b) and 4(c)). These results indicate out that the fusion  of both modalities' information aids the proposed model to comprehend the "Damage Infrastructure (DI)" category better, thus significantly curtail the prediction errors. A different phenomenon is noticed where the predicted label is "Fires" but the actual label is "Non Damage". In particular, the textual model did not misclassify a single  Figure 4(b) whereas the visual and multimodal models wrongly classified 5 and 3 instances in Figure  4(a) and 4(c), respectively. This suggests that for certain categories, unimodal models can be more effective and further investigation is required to address the noise that maybe introduce when two modalities are combined. Figure 5 compares among the proportion of instances that are misclassified in different classes. We notice that most of the mistakes made by the best visual model (ReNet50) with "HD" (33%), "DN" (32.72%), "Flood" (25%), and "DI" (23%) categories. In contrast, the misclassification rate for "ND" (12.71%) and "Fires" (10.8%) classes are comparatively lower than others, which is also evident from Figure 4(a) where the number of misclassified instances are shown. Concerning the textual model (BiLSTM + Attention), we observed that the MR is reduced for almost every class except the "Fires" class (rise from 10.8% to 35%). Nevertheless, MR for the textual model is decreased by a more significant margin of approximately 10%, 12%, and 29% respectively for "ND", "Flood", and "HD" classes. Finally, the proposed model produced fewer mistakes across different classes (also shown in 4(c)). While with the "ND" category, MR increased by approximately 1% (from 1.74% to 2.74%) compared to the textual model, other classes experienced a considerable drop of approximately 5% (DI), 15% (DN), 30% (Fires), and 5% (Flood), respectively.
2) Qualitative Analysis Table 6 shows example samples along with outputs from different models. Overall, these outputs illustrate the need for combining two modalities. For example, in Table 6 sample (1), the visual model wrongly classifies the image into the "Damage Nature (DN)" class since the image contains some trees and leaves. The text only model also classify the image as DN because of the presence of words like '#fallentree' and '#treebranch'. However, when features from these two modalities are combined together, they do not provide any insightful evidence for the model to infer this image-text data pair as "DN" anymore. Likewise, in Table 6 sample(4), the visual model classifies the image as "Damage Infrastructural (DI)", and the textual model also makes incorrect prediction due to the presence of the word '#buildingcollapse'. However, fusing information from both modalities leads the predict the "Fires" category correctly. Finally, in Table 6 sample (5), the visual model consider the image as "DI" because it shows broken roads like patterns, whereas the textual model reckons the text is from the "Flood" class as it mentions words related to flood such as '#flood' and '#tsunami'. However, the proposed multimodal model conjoined the information coming from both modalities and yielded the correct prediction (i.e., "Damage Nature"). Overall, these analyses confirm the suitability of the proposed multimodal approach over the other models in classifying the damage information.

3) Intrinsic Performance Analysis
To further understand the possible reason for this superior performance of our proposed approach over other models, we performed an intrinsic performance analysis. In this analysis, we focus on how the attention layer impacts the performance of the proposed method by comparing with its counterpart (ResNet50+BiLSTM), where the only difference is the absence of the attention mechanism. Figure 6 shows the feature visualization of the two multimodal models (with and without attention). The projected data points are obtained after applying the principal component analysis (PCA) [57] on the extracted hidden features. We observe that the multimodal model without attention (Figure 6 (a)) did not separate all the classes   (Figure 6 (b)). Incorporation of the attention layer made the classes like "ND", "Fires", "DI", and "DN" are more separable and thus enhances the performance across different classes.

4) Proposed vs Existing Methods
As per this work exploration, no significant work has been conducted over the multimodal dataset used in this research except the work done by [7]. However, the past study is not exactly comparable with the proposed method due to the differences in evaluation measures and dataset distribution. Therefore, for the comparison, we adopt several recent techniques that have been explored on similar tasks. For uniformity, existing methods [7], [8], [11], [18], [21]- [23], [29] have been implemented on the same dataset including the proposed method. Table 7 shows the result of the comparison. Mouzannar et al. [7] developed a multimodal model, where they used a pre-trained Inception model for image modality, and a convolutional neural network [58] for textual modality. By replicating their architecture, we obtained a WF-score of 92.21%. Ferda et al. [8] utilized VGG16 + CNNText which has achieved a WF-score of 75.11%. Kumar et al. [11] applied VGG16 (for image) + LSTM (for text) and obtained a WF of 77.84%. Three other works [21]- [23] are implemented considering only the textual modality, where custom and pre-trained CNN's are applied. These three methods also achieved the WF-scores of below 80%. Another work [18] employed logistic regression classifier for the text-VOLUME 4, 2016 based classification and achieved an 86.05% WF-score. The comparative analysis illustrates that the proposed technique outperformed the existing works by achieving the highest WF-score of 93.21%. In particular, it is almost 1% higher WF-score than the multimodal (92.14%) [7] method and 7% higher than the unimodal (86.05%) [18] technique.

VI. CONCLUSION
We have presented a multimodal approach that can effectively learn from the image and text data to

Method
Modality WF(%) Mouzannar et al. [7] Image+Text 92.14 Ferda et. al [8] Image+Text 75.11 Kumar et. al [11] Image+Text 77.84 Nguyen et al [29] Image-only 75.17 Caragea et al. [21] Text-only 75.23 Aipe et. al. [22] Text-only 76.76 Yu et. al. [23] Text-only 78.47 Xiao et. al [18] Text-only 86.05 Proposed Image+Text 93.21 classify the damage-related contents from Twitter. We utilize the pre-trained ResNet model for visual feature extraction and the attention mechanism with a BiLSTM model to extract the tweet features. The early fusion approach is used to aggregate both modalities' features. Besides, this work investigated various visual (i.e., VGG19, Inception) and textual (i.e., BiLSTM, CNN, BiSTM+CNN, BiLSTM+Attention) approaches for the baseline evaluation and constructed several multimodal models by exploiting them. The evaluation results revealed that the proposed model outperforms the baseline unimodal (i.e., image, text) and multimodal models by acquiring the highest weighted F1-score of 93.21%. Moreover, the comparative analysis illustrated that the proposed method outcome is approximately 1% and 7% ahead of the existing start-ofthe-art models. Thus, the results confirmed the effectiveness of the proposed method in identifying the disaster content based on multimodal information. The error analysis further showed that it is difficult to identify the damage and nondamage contents by analyzing only one modality. At the same time, intrinsic performance analysis elucidated that incorporating an attention mechanism boosts the overall performance. Despite achieving better performance than unimodal approaches, there are still rooms for improving the proposed method. In the future, we would like to explore different multimodal fusion approaches along with multitask learning technique for the disaster identification task. Besides, we aim to capture the combination of visual and textual features more effectively by employing the state of the art visual (i.e., Vision transformer [59]), textual (i.e., BERT [60], XLM-R [61]), and multimodal (i.e., VL-BERT [62], Visual BERT [63]) transformer models.