Robust Training of Social Media Image Classification Models

Images shared on social media help crisis managers gain situational awareness and assess incurred damages, among other response tasks. As the volume and velocity of such content are typically high, real-time image classification has become an urgent need for faster disaster response. Recent advances in computer vision and deep neural networks have enabled the development of models for image classification for a number of tasks, including detecting crisis incidents, filtering irrelevant images, classifying images into specific humanitarian categories, and assessing the severity of the damage. To develop robust models, it is necessary to understand the capability of the publicly available pretrained models for these tasks, which remains to be underexplored in the crisis informatics literature. In this study, we address such limitations by investigating ten different network architectures for four different tasks using the largest publicly available datasets for these tasks. We also explore various data augmentation strategies, semisupervised techniques, and a multitask learning setup. In our extensive experiments, we achieve promising results.


Introduction
Social media is widely used during natural or human-induced disasters to disseminate information and obtain valuable insights quickly.People post content (i.e., through different modalities such as text, image, and video) on social media to ask for help, to offer support, to identify urgent needs, or to share their feelings.Such information is helpful for humanitarian organizations to plan and launch relief operations.As the volume and velocity of the content are significantly high, it is crucial to have real-time systems to process social media content to facilitate rapid response automatically.There has been a surge of research works in this domain in the past couple of years.The focus has been to analyze social media data and develop computational models using varying modalities to extract actionable information.Among different modalities (e.g., text and image), more focus has been given to textual content analysis compared to imagery content (see [31,59,33] for comprehensive surveys).However, many past research works have demonstrated that images shared on social media during a disaster event can also assist humanitarian organizations.For example, Nguyen et al. [49] use images shared on Twitter to assess the severity of the infrastructure damage, and Mouzannar et al. [47] focus on identifying damages in infrastructure as well as environmental elements.
For a clear understanding we provide an example pipeline in Figure 1a which demonstrates how different disaster-related image classification models can be used in real-time for information categorization.As presented in the figure, the four different classification tasks such as (i) disaster types, (ii) informativeness, (iii) humanitarian, and (iv) damage severity assessment, can significantly help crisis responders during disaster events.For example, disaster type classification model can be used for real-time event detection as shown in Figure 1b.Similarly, the informativeness model can be used to filter non-informative images, the humanitarian model can be used to discover fine-grained categories, and the damage severity model can be used to assess the impact of the disaster.Current literature reports either one or two tasks using one or two network architectures.Another limitation is that there have been limited datasets for disaster-related image classification.Very recently the study by Alam et al. [9] developed a benchmark dataset, 1 which is consolidated from existing publicly available resources.The development process of this dataset consists of data curation from different existing sources, development of new data for new tasks, creating non-overlapping 2 training, development, and test sets.The reported benchmark dataset targeted the four tasks as shown in Figure 1a.
In this study, we build upon [9] and address the aforementioned limitations by posing the following Research Questions (RQs): -RQ1: Can data consolidation help?-RQ2: Among various neural network architectures with pre-trained weights, which one is more suitable for different downstream disaster-related image classification tasks?-RQ3: Does data augmentation or semi-supervised learning help to improve the performance?-RQ4: Is multitask learning an ideal solution to reduce computational complexity when there is need to make predictions for multiple tasks simultaneously?
To understand the benefits of data consolidation (RQ1 ), we extended the work by Alam et al. [9] with more in-depth analysis.Our motivation for RQ2 is that there has been significant progress in neural network architectures for image processing in the past few years; however, they have not been widely explored in the crisis informatics3 domain for disaster response tasks.Hence, we investigated several neural network architectures for different disasterrelated image classification tasks.Since augmentation and self-training-based techniques [16,42] have shown success to yield a more generalized model and sometimes improve the performance, we posed RQ3 and investigated them for the mentioned tasks.For the real-time social media image classification tasks shown in Figure 1, it is necessary to run the mentioned models in sequence or parallel for the same input image.Running multiple models can be prohibitively expensive when there is a need to analyze many social media images in real-time.Having a single model for dealing with multiple tasks can significantly alleviate the computational complexity.Hence, we posed RQ4 to instigate research in this direction.The Crisis Benchmark Dataset has not been originally developed for multitask learning setup.However, the related metadata information (e.g., image ids) are available, and we utilized such information to create data splits for multitask learning while trying to maintain the same training, development, and test splits.As our experiment shows, this is challenging due to the incomplete labels for different tasks (see more details in Section 4.6).
To summarize, our contributions in this study are as follows: -We present more detailed results highlighting the benefit of data consolidation.
-We address four tasks using several state-of-the-art neural network architectures on different data splits.-We investigate various data augmentation techniques and show that model generalization improves with data augmentation.-We explore semi-supervised learning and multitask learning to have a single model while addressing multiple tasks.Based on the findings, we provide research directions for future studies.-We also provide insights using Gradient-weighted Class Activation Mapping [62] to demonstrate what class-specific discriminative properties are learned by the networks.
The rest of the paper is organized as follows.Section 2 provides a brief overview of the existing work.Section 3 introduces the tasks and describes the datasets used in this study.Section 4 explains the experiments, Section 5 presents the results, and Section 6 provides a discussion.Finally, we conclude the paper in Section 8.

Social Media Images for Disaster Response
The studies on image processing in the crisis informatics domain are relatively few compared to the studies on analyzing textual content for humanitarian aid. 4ith recent successes of deep learning for image classification, research works have started to use social media images for humanitarian aid.The importance of imagery content on social media for disaster response tasks has been reported in many studies [56,17,15,48,49,5,8].For instance, the analysis of flood images has been studied in [56], in which the authors reported that the existence of images with the relevant textual content is more informative.Similarly, the study by Daly and Thom [17] analyzed fire event images, which are extracted from social media data.Their findings suggest that images with geotagged information are helpful to locate the fire-affected areas.
The analysis of imagery content shared on social media has recently been explored using deep learning techniques for damage assessment purposes.Most of these studies categorize the severity of damage into discrete levels [48,49,5] whereas others quantify the damage severity as a continuous-valued index [50,44].Other related work include data scarcity issue by employing more sophisticated models such as adversarial networks [43,57], disaster image retrieval [4], image classification in the context of bush fire emergency [40], flooding photo screening system [51], sentiment analysis from disaster image [22], monitoring natural disasters using satellite images [3], and flood detection using visual features [34].

Real-time Systems
Recently, Alam et al. [8] presented an image processing pipeline to extract meaningful information from social media images during a crisis situation, which has been developed using deep learning-based techniques.Their image processing pipeline includes collecting images, removing duplicates, filtering irrelevant images, and finally classifying them with damage severity.Such a system has been used during several disaster events, and one such example is the deployment during Hurricane Dorian, reported in [30].The system has been deployed for 13 days, and it collected around ∼280K images.These images are then automatically classified and used by a volunteer response organization, Montgomery County Maryland Community Emergency Response Team (MCCERT).Another example use case is the early detection of disasterrelated damage to cultural heritage [39].

Multimodality (Image and Text)
The exploration of multimodality has also received attention in the research community [2,1].In [2], authors explore different fusion strategies for multimodal learning.Similarly, in [1] a cross-attention-based network is exploited for multimodal fusion.The study in [28] reports a multimodal system for flood image detection, which achieves a precision of 87.4% in a balance test set.In another study, the authors propose a similar multimodal system for on-topic vs. off-topic social media post classification and report an accuracy of 92.94% with imagery content [27].The study in [20] explores different classical machine learning algorithms to classify relevant vs. irrelevant tweets using textual and imagery information.On the imagery content, they achieved an F1-score of 87.74% using XGboost [14].The study in [58] proposes a simple, computationally inexpensive, multimodal two-stage framework to classify tweets (text and image) with built-infrastructure damage vs. nature-damage.The study investigated their approach using a home-grown dataset, and the SUN dataset [73].The study by Mouzannar et al. [47] proposed a multimodal dataset, which has been developed for training a damage detection model.Similarly, Ofli et al. [52] explores unimodal as well as different multimodal modeling approaches based on a collection of multimodal social media posts.

Transfer Learning for Image Classification
For the image classification task, transfer learning has been a popular approach, where a pre-trained neural network is used to train a model for a new task [76,64,55,54,52,47].For this study, we follow the same approach using different deep learning architectures.

Datasets
Currently, publicly available datasets include damage severity assessment dataset [49], CrisisMMD [7] and damage identification multimodal dataset [47].The first dataset is only annotated for images, whereas the last two are annotated for both text and images.Other relevant datasets are Disaster Image Retrieval from Social Media (DIRSM) [13] and MediaEval 2018 [10].The dataset reported in [21] is constructed for detecting damage as an anomaly using pre-and post-disaster images.It consists of 700,000 building annotations.A similar and relevant work is the development of the Incidents dataset [72], which consists of 446684 manually labeled Web images with 43 incident categories.The Crisis Benchmark Dataset reported in [9] is the largest so far for social media disaster image classification.
For this study, we use the Crisis Benchmark Dataset, and our study differs from [9] in a number of ways.We provide more detailed experimental results on dataset comparison (i.e., individual vs. consolidated), compare different network architectures with a statistical significance test, and report the efficacy of data augmentation.We have also utilized a large unlabeled dataset to enhance the capability of the current model.We created multitask data splits from Crisis Benchmark Dataset and report experimental results using both missing/incomplete and complete labels, which can serve as a baseline for future works.

Tasks and Datasets
For this study, we addressed four different disaster-related tasks that are important for humanitarian aid.Below we provide details of each task and the associated class labels.

Disaster type detection
When ingesting images from unfiltered social media streams, it is important to detect different disaster types automatically from these images.For instance, an image can depict a wildfire, flood, earthquake, hurricane, and other types of disasters.In the literature, disaster types have been defined in different hierarchical categories such as natural, human-induced, and hybrid [63].Natural disasters are events that result from natural phenomena (e.g., fire, flood, earthquake).Human-induced disasters result from human actions (e.g., terrorist attacks, accidents, wars, and conflicts).Hybrid disasters result from human actions, which affect natural phenomena afterward (e.g., deforestation results in soil erosion and climate change).The class labels for disaster type include (i) earthquake, (ii) fire, (iii) flood, (iv) hurricane, (v) landslide, (vi) other disaster-to cover all other disaster types (e.g., plane crash), and (vii) not disaster-for images that do not show any identifiable disasters.

Informativeness
Images posted on social media during disasters do not always contain informative (e.g., an image showing damaged infrastructure due to flood, fire, or any other disaster events) or useful content for humanitarian aid.It is necessary to remove any irrelevant or redundant content to facilitate crisis responders' efforts more effectively.Therefore, the purpose of this classification task is to filter out irrelevant images.The class labels for this task are (i) informative and (ii) not informative.

Humanitarian
An important aspect of crisis responders is to assist people based on their needs, which requires information to be classified into more fine-grained categories to take specific actions.In the literature, humanitarian categories often include affected individuals; injured or dead people; infrastructure and utility damage; missing or found people; rescue, volunteering, or donation effort; and vehicle damage [7].In this study, we focus on four categories that are deemed to be the most prominent and important for crisis responders such as (i) affected, injured, or dead people, (ii) infrastructure and utility damage, (iii) rescue volunteering or donation effort, and (iv) not humanitarian.

Damage severity
Assessing the severity of the damage is important to help the affected community during disaster events.The severity of damage can be assessed based on the physical destruction to a built structure visible in an image (e.g., destruction of bridges, roads, buildings, burned houses, and forests).Following the work Fig. 2: An image annotated as (i) fire event, (ii) informative, (iii) infrastructure and utility damage, and (iv) severe damage.
reported in [49], we define the categories for this classification task as (i) severe damage, (ii) mild damage, and (iii) little or none.Figure 2 shows an example image with the labels for all four tasks.

Datasets
As mentioned earlier, we used the dataset reported in [9]. 5 This dataset has been developed by curating existing publicly available sources, creating nonoverlapping training, development, and test splits.For the sake of clarity and completeness, we provide a brief overview of the dataset.More details of the dataset curation and consolidation process can be found in [9].

Damage Assessment Dataset (DAD)
The damage assessment dataset consists of labeled imagery data with damage severity levels such as severe, mild, and little-to-no damage [49].The images have been collected from two sources: AIDR [32] and Google.To crawl data from Google, authors used the following keywords: damage building, damage bridge, and damage road.The images from AIDR were collected from Twitter during different disaster events such as Typhoon Ruby, Nepal Earthquake, Ecuador Earthquake, and Hurricane Matthew.The dataset contains ∼25K images annotated by paid workers as well as volunteers.In this study, we use this dataset for the informativeness and damage severity tasks.For the informativeness task, the study in [9] mapped the mild and severe images into informative class and manually categorized the little-to-no damage images into informative and not informative categories.For the damage severity task, the label little-to-no damage mapped into little or none to align with other datasets.

CrisisMMD
This is a multimodal (i.e., text and image) dataset, which consists of 18,082 images collected from tweets during seven disaster events crawled by the AIDR system [7].The data is annotated by crowd workers using the Figure-Eight platform 6 for three different tasks: (i) informativeness with binary labels (i.e., informative vs. not informative), (ii) humanitarian with seven class labels (i.e., "infrastructure and utility damage", "vehicle damage", "rescue, volunteering, or donation effort", "injured or dead people", "affected individuals", "missing or found people", "other relevant information" and "not relevant"), (iii) damage severity assessment with three labels (i.e., severe, mild and "little or no damage").For the humanitarian task similar class labels are grouped together.The images with labels injured or dead people and affected individuals are mapped into one class label affected, injured, or dead people; infrastructure and utility damage and vehicle damage are mapped into infrastructure and utility damage; other relevant information, and not relevant are mapped into not humanitarian.
The images with label missing or found people are removed as it is difficult to identify.This results in four class labels for humanitarian task.

AIDR Disaster Type Dataset (AIDR-DT)
AIDR-DT dataset consists of tweets collected from 17 disaster events and 3 general collections.The tweets of these collections have been collected by the AIDR system [32].The 17 disaster events include flood, earthquake, fire, hurricane, terrorist attack, and armed-conflict.The tweets in general collections contain keywords related to natural disasters, human-induced disasters, and security incidents.Images are crawled from these collections for disaster type annotation.The labeling of these images was performed in two steps.First, a set of images were labeled as earthquake, fire, flood, hurricane, and none of these categories.Then, a sample of ∼2,200 images labeled as none of these categories in the previous step are selected for annotating not disaster and other disaster categories.
For the landslide category, images are crawled from Google, Bing, and Flickr using keywords landslide, mudslide, "mud slides", landslip, "rock slides", rockfall, "land slide", earthslip, rockslide, and "land collapse".As images have been collected from different sources, therefore, it resulted in having duplicates.Duplicate filtering has been applied to remove exact-and near-duplicate images to resolve this issue.Then, the remaining images were manually labeled as landslide and not landslide.The resulted annotated dataset consists of labeled images with seven categories defined in Section 3.1.1.

Damage Multimodal Dataset (DMD)
The multimodal damage identification dataset consists of 5,878 images collected from Instagram and Google [47].The authors of the study crawled the images using more than 100 hashtags, which are proposed in crisis lexicon [53].The manually labeled data consist of six damage class labels: fires, floods, natural landscape, infrastructural, human, and non-damage.The non-damage image includes cartoons, advertisements, and images that are not relevant or useful for humanitarian tasks.The study by Alam et al. [9] re-labeled images for all four tasks: disaster type, informativeness, humanitarian, and damage severity using the same class labels discussed in the previous section.

Data Analysis
To understand different aspects of the dataset, we analyze the distribution of images shared during different events, images shared by a different type of users (e.g., verified vs. unverified), and other characteristics.The dataset comprises images collected from different sources such as Google, Bing, Yahoo, and Twitter.Since only the images collected from Twitter contain social media information, we analyzed only those images that have Twitter's JSON objects (∼27K images).In Table 1, we report statistics of the collected tweets and images for different events.It appears that people share images in only 1 to 5% of the posts.We investigated the effect of the images shared by verified vs. unverified users.In Figure 3, we show two example images, one from a verified user (a) and another from an unverified user (b).We notice that images shared by verified users get more retweets than those shared by unverified users.For example, the image in Figure 3a has been retweeted 4,268 times and liked 11.7K times whereas Figure 3b has not been retweeted even though it shows similar severe infrastructure damage.Among ∼27K images, there are 5,527 images with verified users and 22,207 images with unverified users.The users who shared a higher number of images are mostly news agencies.For example, in the annotated ∼27K images, we found that California Top News7 shared 49 images during 2017 California wildfires, and among them 30 of the images are about "infrastructure and utility damage" or "rescue volunteering or donation effort".
We also analyzed image sharing behavior during disaster events where we considered all collected images with or without labels. 8We observed that more images are posted during the early days of a disaster and it gradually decreases, as illustrated in Figure 4.

Event name
Year

Data Split
Before consolidating the datasets, each dataset has been divided into training (train), development (dev), and test sets with 70:10:20 ratio, respectively.The purpose was threefold: (i) train and evaluate individual datasets on each task, (ii) have a close-to-equal distribution from each dataset into the final consolidated dataset, and (iii) provide the research community an opportunity to use the splits independently.After data split, duplicate images are identified across sets and moved into the training set to create a non-overlapping test set.

Data Consolidation
The primary motivation to perform data consolidation is to develop robust deep learning models with large amounts of data.For this purpose, all train, dev, and test sets are merged into the consolidated train, dev, and test sets, respectively.While doing so, duplicate images from the dev and test sets are moved into the train set to create non-overlapping splits.More detail about the duplicate identification process can be found in [9].

Data Statistics
Tables 2, 3, 4, 5, and 6 show the label distribution of all datasets for all four tasks.Some class labels are skewed in individual datasets.For example, in disaster type datasets (Table 2), the distribution of the "other disaster" label is low in the AIDR-DT dataset, whereas the distribution of the "landslide" label low in the DMD dataset.For the informativeness task, low distribution is observed for the "informative" label.Moreover, for the humanitarian task, we have low distribution for the "rescue volunteering or donation effort" label in the DMD dataset, and for the damage severity task "mild" label in CrisisMMD and DMD datasets.However, the consolidated dataset creates a fair balance across class labels for different tasks, as shown in Table 6.

Experiments
Our experiments consists of (i) individual vs. consolidated datasets comparison (RQ1 ), (ii) neural network architectures comparison (RQ2 ) on the consolidated dataset, (iii) data augmentation (RQ3 ), (iv) semi-supervised learning (RQ3 ), and (iv) multitask learning (RQ4 ).Below we first provide experimental settings, and then, discuss different experiments that we conducted for this study.
We use the weights of the networks pre-trained using ImageNet [19] to initialize our model.We adapt the last layer (i.e., softmax layer) of the network according to the particular classification task at hand instead of the original 1,000-way classification.The transfer learning approach allows us to transfer the features and the parameters of the network from the broad domain (i.e., large-scale image classification) to the specific one.Put specifically, we designed  a binary classifier for the informativeness task and multi-class classifiers for the remaining three tasks.We train the models using the Adam optimizer [36] with an initial learning rate of 10 −5 , which is decreased by a factor of 10 when accuracy on the dev set stops improving for 10 epochs.The models were trained for 150 epochs.To measure the performance of each classifier, we use weighted average precision (P), recall (R), and F1-score (F1).

Dataset Comparison
To determine whether consolidated data helps in achieving better performance, we train the models using training sets from the individual and consolidated datasets.However, we always test the models on the consolidated test set.As our test data is the same across different experiments, this ensures that results are comparable.Since we have four different tasks, consisting of fifteen different datasets, we only experimented with the ResNet18 [23] network architecture to manage the computational load.

Network Architectures
Currently available neural network architectures come with different computational complexity.As one of our goals is to deploy the models in real-time applications, we exploit them to understand their performance differences.Another motivation is that current literature in crisis informatics only reports results using one or two network architectures (e.g., VGG16 in [52], InceptionNet in [47]), which may lead to sub-optimal outcomes.

Data Augmentation
Data augmentation is a commonly used technique to improve the generalization of deep neural networks in the absence of large-scale datasets.We experiment with the recently proposed RandAugment [16] method for image augmentation.In literature, RandAugment was proposed as a fast alternative for learned augmentation strategies.We used the PyTorch implementation10 in our experiments.To increase the diversity of generated examples, we used the following 16 different transformations: N : the number of augmentation transformations to apply sequentially M : magnitude for all the transformations.
Each transformation resides on an integer scale from 0 to 30, with 30 being the maximum strength.In our experiments, we use constant magnitude M for all augmentations.The augmentation method then boils down to randomly selecting N transformations and applying each transformation sequentially with strength corresponding to scale M .
In addition, we used weight decay, which is one of the most commonly used techniques for regularizing parametric machine learning models [46].This helps to reduce the overfitting of the models and avoids exploding gradient.
We have conducted the data augmentation experiments using all ten different neural network architectures.We used a weight decay of 10 −3 and other hyperparameters remain the same as discussed in Section 4.1.

Semi-supervised Learning
State-of-the-art image classification models are often trained with a large amount of labeled data, which is prohibitively expensive to collect in many applications.Semi-supervised learning is a powerful approach to mitigate this issue and leverage unlabeled data to improve the performance of machine learning models.Since unlabeled data can be obtained without significant human labor, performance boost gained from semi-supervised learning comes at low cost and can be scaled easily.In literature many semi-supervised techniques has been proposed focusing on deep learning [75,66,11,12,41,42,45,60,70,71,74,6].Among them self-training approach is one of the earliest [61], which has been adopted for deep neural network.The self-training approach, also called pseudo-labeling [42], uses the model's prediction as a label and retrains the model against it.
For this study, we use Noisy student (i.e., a simple self-training approach) training, which was proposed in [75] as a semi-supervised learning approach to improve the accuracy and robustness of state-of-the-art image classification models.The algorithm consists of three main steps: Step 1: Train a teacher model on labeled images Step 2: Use the teacher model to generate pseudo labels on unlabeled images Step 3: Train a student model on combined labeled and pseudo labeled images The algorithm can be iterated multiple times by treating the student as the new teacher and labeling the unlabeled images with this model.During the learning phase of the student, different noises can be injected, such as dropout [67] and data augmentation via RandAugment [16].The student model is made larger than or equal to the teacher.The presence of noise and larger model capacity help the student model generalize better than the teacher.
Labeled dataset: As for the labeled dataset, we used our consolidated datasets and ran the experiments for all tasks.
Unlabeled dataset: To obtain unlabeled images, we crawled images from the tweets of 20 different disaster collections (as mentioned in Section 3.2.3).We removed duplicates and ensured the same images are not in our labeled dataset by matching their ids and applying duplicate filtering.The resulting unlabeled dataset consists of 1,514,497 images.
Architecture: We ran our experiments using the EfficientNet (b1) architecture as it performed better than the other models.In addition, it is one of the models used with Noisy student experiments reported in [75].One significant difference between [75] and our work is that we initialize our student model's weight with ImageNet pre-trained weights.In contrast, in [75], they train weights from scratch.Since our labeled dataset is significantly smaller than the ImageNet dataset, training from scratch substantially degrades performance in our experiments.

Training details:
We first trained the model using the EfficientNet (b1) architecture on the labeled dataset (Step 1), which is referred to as the teacher model.We then predicted output for the unlabeled images (Step 2).After that, we trained the student EfficientNet (b1) model by combining labeled and pseudo-labeled images (Step 3).In this step, for the unlabeled data, we performed different filtering and balancing.We selected the images that have a confidence label greater than a certain task-specific threshold.After this, we balanced the training data so that each class has the same number of images as the class having the lowest number of images.To do this, for each class, we take the images having the highest confidence scores.
For the experiments, we used a batch size of 16 for labeled images and 48 for unlabeled images.Labeled and unlabeled images are concatenated together to compute the average cross-entropy loss.We used RandAugment with the number of augmentation, N = 5, and the strength of augmentation, M = 12.We optimized the confidence thresholds separately for different tasks using the dev sets.The thresholds for disaster types, informativeness, humanitarian, and damage severity tasks were respectively 0.7, 0.8, 0.45, and 0.45.Similar to the data augmentation experiments, we used a weight decay of 10 −3 and kept other hyper-parameters the same as discussed in Section 4.1.

Multi-task Learning
Since the tasks share similar properties, we also consider training the model in multitask settings with shared parameters.The benefits of multitask settings can be twofold: (i) learning shared representation can help the model generalize better and improve performance on individual tasks, and (ii) training a single model instead of four different models will yield a significant speed and reduce computational load during training and inference.It is important to mention that the Crisis Benchmark Dataset was not designed for multitask learning; rather, it was prepared for each task separately.Hence, we needed to prepare them for the multitask setup.Creating multitask learning datasets from Crisis Benchmark Dataset introduced a challenge -there is an overlap between train and test set images among different tasks.Hence, we prepare the datasets for the multitask setting using the following strategy: 1. We merge the test sets from different tasks into a combined test set.If an image in the combined test set is present in the train or dev set of some tasks, we remove it from that split and add the label of the task in the test set.2. We merge the dev sets of the four tasks into the combined dev set.If an image in the combined dev set is present in the train set of some tasks, we remove it from that train split and add the label of the task in the dev set.3. We merge the train sets of the four tasks into the combined train set.Since we have removed images that overlap with the dev set and test set in the previous steps, therefore, it guarantees that no image from the train set will be present in the other splits.
Since all the images do not have annotation for all four tasks, there is a discrepancy in the number of images available for different tasks.We report the distribution of the data splits for the multi-task setting in Table 7. Overall, there are 49353 images in the train set, 6157 images in the dev set, and 15688 images in the test set.Due to the overlap of images in different splits for different tasks, there is also a discrepancy between the number of images available between multi-task and single-task settings.As an example, for the disaster types task, there are 12846 images in the train set, 1470 images in the dev set, and 3195 images in the test set in the single-task setting.However, in the multi-task setting, these numbers are respectively 10996, 1797, and 4718.As a consequence of our merging procedure, there are more images in the test and dev sets and fewer images in the train set.Few approaches have been proposed in the literature to address the issue of incomplete/missing labels in multi-task settings.They usually work by generating missing task labels using different methods, including Bayesian networks [35], rule-based approach [37], knowledge distillation from another model [18].In our experiments, we opt for a simpler alternative.Specifically, we do not compute loss for a task if its label is missing.Since the tasks have varying training images, we calculate the loss for each task and aggregate them in a batch.This ensures that the loss of each task is weighted equally.The steps are detailed in Algorithm 1.
We also experiment with images having complete aligned labels for different tasks.We identified three such combinations that have a substantial number of images in different classes.Two of them belong to two task subsets.The first one is informativeness and humanitarian, which has 7,960 total aligned images.The second one is informativeness and damage severity, having 25,830 total images.Data distribution for these two settings is reported in Table 8.The final subset of images having labels for all four tasks, which consists of 5558 images.Data distribution for this set is reported in Table 9.

Dataset Comparison
In Table 10, we report classification results for different tasks and different datasets using ResNet18 network architecture.The performance of different tasks are not equally comparable as they have different levels of complexity (e.g., varying number of class labels, class imbalance, etc.).For example, the informativeness classification is a binary task, which is computationally simpler than a classification task with more labels (e.g., seven labels in disaster type).Hence, the performance is comparatively higher for informativeness.An example of a class imbalance issue can be seen in Table 6 with the damage severity task.The distribution of mild is relatively small, which reflects on its and overall performance.The mild class label is also less distinctive than other class labels, and we noticed that classifiers often confuse this class label with the other two class labels.Similar findings have also been reported in [49].For the disaster type task, the performance of the AIDR-DT model is higher compared to the DMD model.We observe that the DMD dataset is comparatively small, and the model is not performing well on the consolidated dataset.This characteristic is observed in other tasks as well.For the damage severity task, CrisisMMD is performing worse, which is also reflected in its dataset size, i.e., 2,493 images in the training set, as shown in Table 5.As expected, overall, for all tasks, the models with the consolidated datasets outperform individual datasets.Fig. 5: Average F1 scores from all four tasks with different network architectures, which shows on average EfficientNet (b1) performs better than other architectures.

Network Architectures Comparison
In Table 11, we report results using different network architectures on consolidated datasets for different tasks, i.e., trained and tested using a consolidated dataset.Across different tasks, EfficientNet (b1) is performing better than other models as shown in Figure 5, except for humanitarian task, for which VGG16 is outperforming other models.Comparatively the second-best models are VGG16, ResNet50, ResNet101, and DenseNet (101).From the results of different tasks, we observe that InceptionNet (v3) is the worst-performing model.
The performance difference among different models such as EfficientNet (b1), VGG16, ResNet50, ResNet101, and DenseNet (101) are low, hence, we have done statistical test to understand whether such small differences are significant.We used McNemar's test for binary classification task, (i.e., informativeness) and Bowker's test for other multiclass classification tasks.More details of this test can be found in [24].We have done such tests between two models to see  a pair-wise difference.In Figure 6 and 7, we report the results of significant tests.The value in the cell represent the P -value and the light yellow color represent they are statistically significant with P < 0.05.From the Figure 6, we see that for disaster type task the P -value is higher than 0.05 in comparison between EfficientNet (b1) vs. ResNet50, ResNet101 and DenseNet (121), which clearly reflects among the results reported in Table 11.Similarly the difference is very low between EfficientNet (b1) vs. VGG16 and DenseNet (121).For  Light yellow color represent they are statistically significant with p < 0.05 humanitarian and damage severity tasks, we observed similar behaviors.By analyzing all four tasks it appears VGG16 is the second best performing model.
In Table 12, we also report different neural network models with their number of layers, parameters, and memory consumption during the inference of informativeness task.There is usually a trade-off between the performance and computational complexity of different deep neural networks.In terms of memory consumption and the number of parameters, VGG16 seems expensive than others.Based on the performance and computational complexity, we can conclude that EfficientNet can be the best option for real-time applications.We computed throughput for EfficientNet using a batch size of 128, and it can process ∼260 images per second on an NVIDIA Tesla P100 GPU.ResNet18 is a reasonable choice among different ResNet models, given that its computational complexity is significantly less than other ResNet models.

Data Augmentation
To reduce the overfitting and to have more generalized models, we used data augmentation and weight decay.In Table 13, we report the results for all tasks and using all network architectures.The column Diff.report the difference between the results presented in Table 11 where no RandAugment or weight decay has been applied.The improved results are highlighted with light blue color for all tasks.Out of 40 experiments (10 network architectures 4 tasks), for 26 cases, the augmentation with weight decay improved the performances.On the improved cases, we also computed a statistical significance test between no RandAugment and RandAugment with weight decay models.We  found that the improvements for the models with InceptionNet (v3) are statistically significant in all tasks.Only the improved performance with EfficientNet (b1) for damage severity task is statistically significant, and for other tasks, they are not statistically significant.We investigated training and validation losses over the number of epochs.In Figure 8 and 9, we report training, validation losses and accuracies for EfficientNet (b1) model for Informativeness and Humanitarian tasks, respectively.From the figures 8a and 9a, we clearly see that models are overfitting, whereas Figures 8b and 9b show that models are more generalized.These findings demonstrate the benefits of augmentation and weight decay.

Semi-supervised Learning
In Table 14, we present the results of the Noisy student-based self-training approach without/with RandAugment results.We have an ∼ 1% improvement for the Informativeness task.For the Humanitarian task, the performance is similar to RandAugment.For the Damage severity task, the performance of Noisy student is the same as without RandAugment but lower than RandAugment.
We postulate the following possible reasons for the lack of improvements in semi-supervised learning experiments: 1. Semi-supervised learning usually performs better when trained from scratch instead of fine-tuning from a pretrained model.This phenomenon is explored in [77] where the authors reported the performance gained from semi-supervised learning methods are usually smaller when trained from a problem, which we addressed using masking, i.e., for an unlabeled output, we are not computing loss for that particular task.In Table 15, we report the results of multitask learning with missing labels where we address all tasks.We also investigated different task combinations where all labels are present.In Table 16, we report the results of different tasks combinations where they have complete aligned labels.For different task combinations, performances differ due to their data sizes, label distribution, and task settings.The results with multitask learning are not directly comparable with our single task setup.However, they can serve as a baseline for future studies.

Visual Explanation using Grad-CAM
We explore how the neural networks arrive at their decision by utilizing Gradientweighted Class Activation Mapping (Grad-CAM) [62].Grad-CAM uses the gradient of a target class flowing into the final convolution layer to produce a localization map highlighting the important regions in the image for that specific class.We report results for two candidate networks, i.e., VGG16 and EfficientNet, on two tasks, i.e., informativeness and disaster type.We use the models trained using RandAugment for this experiment.
In Figure 10, we show the activation map for the predicted class for some images from the informativeness test set.From these images, it is apparent that EfficientNet performs better for localizing important regions in the image for the class of interest.VGG16 tends to depend on smaller regions for decision-making.The last row shows an image where VGG16 misclassified an informative image as not informative.
We show the activation map for some images from the test set of the disaster type task in Figure 11.Here, the difference in localization quality between the two models is even more pronounced.The activation maps from VGG are difficult to interpret in the first and third images, even though the model Fig. 10: GradCAM visualization of some images for the informativeness task.classifies them correctly.The second image shows that VGG may focus on the smoke regions for classifying fire images.This explains why it identifies the last image as fire, misclassifying the clouds as smoke.
Overall, these results suggest that EfficientNet does not only outperform other models in the numeric measures but it also produces activation maps that are easier to interpret.
6 Discussion and Future Work

Our Findings
Real-time event detection is an important problem from social media content.Our proposed pipeline and models are suitable to deploy them in real-time Fig. 11: Grad-CAM visualization of some images for the disaster type task.
applications.The proposed models can also be used independently.For example, disaster type model can be used to monitor real-time disaster events.
Our experiments were based on the research questions discussed in Section 1 below we report our findings based on them.
RQ1 : Our investigation to dataset comparison suggests that data consolidation helps, which answers our first research question.
RQ2 : We also explore several deep learning models, which vary with performance and complexities.Among them, EfficientNet (b1) appears to be a reasonable option.Note that EfficientNet has a series of network architectures (b0-b7) and for this study, we only reported results with EfficientNet (b1).We aim to further explore other architectures.A small and low latency model is desired to deploy mobile and handheld embedded computer vision applications.The development of MobileNet [25] sheds light towards that direction.Our experimental results suggest that it is computationally simpler and provides a reasonable accuracy, only 2-3% lower than the best models for different tasks.These findings answer out second research question.
RQ3 : We observe that strong data augmentation can improve performance, although this is not consistent across different tasks and models.Semi-supervised learning does not usually yield performance when trained using pretrained models and can sometimes even degrade it.
RQ4: Multi-task learning can be an ideal solution for the real-time system as it can potentially provide speed-ups of multiple factors during inference.However, some tasks may perform worse than their single task settings in the presence of incomplete labels.Having aligned complete labels for different tasks can mitigate this issue.We compared our results with recent and related state-of-the-art results, reported in Table 17.However, it is not possible to have an end-to-end comparison for a few possible reasons: (i) different datasets and sizes -see the second and third columns in Table 17, (ii) different data splits (train/dev/test vs. Cross Validation (CV) fold) even using same dataset -see the Data Split column in the same Table, (iii) different evaluation measures such as weighted P/R/F1-measure (first two rows) [52] vs. accuracy (third row) [47] vs. CV fold (fourth to sixth rows -unspecified in [2] whether measures are macro, micro or weighted).Even if they are not exactly comparable, we observe that on informativeness and humanitarian tasks, previously reported results (weighted F1) are 0.832 and 0.763, respectively, using the CrisisMMD dataset [52].The authors in [47] reported a test accuracy of 0.840 ± 0.0172 for six disaster types tasks using the DMD dataset with a five-fold cross-validation run.The study in [2] report an F1 of 0.820 for informativeness, 0.920 for infrastructure damage, and 0.940 for damage severity.In another study, using the CrisisMMD dataset, authors report weighted-F1 of 0.812 and 0.870 for informativeness and humanitarian tasks, respectively [1].They used a small subset of the whole CrisisMMD dataset in their study.From the Table 17 we observe that the F1 for informativeness task ranges from 0.812 to 0.832 across studies, for humanitarian task it varies from 0.763 to 0.870, and for damage severity it varies from 0.661 to 0.940.Compared to them our best results (weighted F1) for disaster types, informativeness, humanitarian and damage severity are 0.835, 0.876, 0.784, and 0.765, respectively, on the consolidated single task dataset.

Future Work
As for future work we foresee several interesting research avenues.(i) Further exploration of semi-supervised learning to leverage a large amount of unlabeled social media data and address the limitations highlighted in Section 5.4.We believe addressing such limitations can help to advance state of the art.(ii) In multitask setup, one possible research direction is to address the problem of incomplete/missing labels, and the other is manually labeling Crisis Benchmark Dataset for incomplete labels for all tasks.Both approaches will give the community grounds to explore multitask learning for real-time social media image classification.

Applications
There are many application scenarios of the proposed models, however, in this section we discuss the ones that are highly relevant for crisis responders in humanitarian organizations.Information for Situational Awareness: The information posted on social media during natural or human-induced disasters varies greatly.Studies have revealed that a big proportion of social media data consists of irrelevant information that is not useful for any kind of relief operations.For the decisionmaking process, humanitarian organizations are interested to have concise information about the ongoing situation to be aware of the event.The proposed models can help in filtering and reducing irrelevant content and provide a concrete summary.Actionable Information: Depending on their roles and mandate, humanitarian organizations differ in terms of their information needs.Several rapid response and relief agencies look for fine-grained information about specific incidents, which is also actionable.Such information types include reports of injured or dead people, critical infrastructure damage (e.g., a collapsed bridge), and rescue demand among others.Our study focused on coarse (i.e., binary) to fine-grained labels while also addressed four different but related tasks.Applications can be developed on top of our models, which can provide critical humanitarian information needs in crisis situations.
Real-time Crisis Event Detection: The proposed models (i.e., disaster type) can be deployed in real-time to continuously monitor social media and detect emergent events (e.g., fire, flood) around the world.

Conclusions
The imagery and textual content available on social media have been used by humanitarian organizations in times of disaster events.There has been limited work for disaster response image classification tasks compared to text.In this study, we posed four research questions and performed extensive experiments on four tasks such as disaster type, informativeness, humanitarian, and damage severity to answer those questions.Our experimental results on individual and consolidated datasets suggest that data consolidation helps.We investigated four tasks using various state-of-the-art neural network architectures and reported the best-performing models.The findings on data augmentation suggest that a more generalized model can be obtained with such approaches.Our investigation on semi-supervised and multitask learning suggests new research directions for the community.We also provide some insights of activation maps to demonstrate what class-specific information is learned by the network.

( a )
Disaster image classification pipeline.(b) Event detection use case showing landslide images.

Fig. 1 :
Fig. 1: Disaster image classification pipeline that demonstrate a real use caselandslide image classification.
(a) Image shared by a verified user.(b) Image shared by an unverified user.

Fig. 3 :
Fig. 3: Images shared by verified vs. unverified users.Both images show severe infrastructure damage.
ie n tN e t (b 1 )

Fig. 6 :
Fig. 6: Statistical significant test among the different network architectures for Disaster type and Informativeness tasks.P -values are presented in cells.Light yellow color represent they are statistically significant with p < 0.05

Fig. 7 :
Fig. 7: Statistical significant test among the different network architectures for Humanitarian and Damage severity tasks.P -values are presented in cells.Light yellow color represent they are statistically significant with p < 0.05 With RandAugment and weight decay

Fig. 8 :
Fig. 8: Training/validation losses and accuracies without and with augmentation for Informativeness task.
With RandAugment and weight decay

Fig. 9 :
Fig. 9: Training/validation losses and accuracies without and with augmentation for Humanitarian task.

Table 1 :
# tweets # images % of images Start Date End date Number of tweets and images collected during different disaster events.

Table 2 :
Data split for the disaster type task.

Table 3 :
Data split for the informativeness task.

Table 4 :
Data split for the humanitarian task.

Table 5 :
Data split for the damage severity task.

Table 6 :
Data splits for the consolidated dataset for all tasks.

Table 7 :
Data split for multi-task setting with incomplete/missing labels.DS: Disaster types, Info: Informative, Hum: Humanitarian, DS: Damage Severity

Table 8 :
Data split for multitask setting with complete aligned labels for the different combinations of two-tasks.

Table 9 :
Data split for multi-task setting with complete aligned labels for four-tasks: DS, Info, Hum and DS.

Table 11 :
Results using different neural network models on the consolidated dataset with four different tasks.Trained and tested using the consolidated dataset.Comparable results are shown in bold and best results are shown in underlined.IncepNet (InceptionNet), MobNet (MobileNet), EffiNet (Efficient-Net)

Table 12 :
Different neural network models with number of layer, parameters and memory requirement during the inference of a binary (Informativeness) classification task.

Table 13 :
Results with data augmentation and weight decay using different neural network models on the consolidated dataset for all four tasks.Diff.represents the difference without RandAugment results presented in Table11.* represents statistically significant (with P < 0.05) compared to the without RandAugment results.

Table 16 :
Results of multitask learning with different tasks combinations and complete labels.DT: Disaster Type, Info: Informative, Hum: Humanitarian, DS: Damage Severity.

Table 17 :
Recent relevant results reported in the literature.# C : Number of class labels, Cls: Classification task, B: Binary, M: Multiclass, Incep: InceptionNet (v4), Info: Informativeness, Hum: Humanitarian, Event: Disaster event types, Infra.: Infrastructural damage, Severity: Severity Assessment.We converted some numbers from percentage (reported in the different literature) to decimal for an easier comparison.Comparison with the State of the Art