Multimodal Information Fusion for Weather Systems and Clouds Identification From Satellite Images

Seeing the cloud and then understanding the weather is one of the important means for people to forecast weather. There has been a certain progress in the use of deep learning technology for weather forecasting, especially in the automatic understanding of disaster weather from satellite image, which can be seen as the image classification problem. Publicly available satellite image benchmark database tries to link weather directly with satellite images. However, single image modal is far from enough to correctly identify weather systems and clouds. Thus, we integrate images with meteorological elements, in which five kinds of meteorological elements, such as season, month, date stamp, and geographic longitude, and latitude, are labeled. To effectively use such various modalities for clouds and weather systems identification through satellite image classification tasks, we propose a new satellite image classification framework: multimodal auxiliary network (MANET). MANET consists of three parts: image feature extraction module based on convolutional neural network, meteorological information feature extraction module based on perceptron, and layer-level multimodal fusion. MANET successfully integrates the multimodal information, including meteorological elements and satellite images. The experimental results show that MANET can achieve better weather systems and clouds and land cover classification results based on satellite images.


I. INTRODUCTION
A BOUT 75% of global economic losses are due to disastrous weather, and more than 10 000 people die every year due to severe weather [1]. Disastrous weather, including tropical cyclone [2], [3], [4], severe convection [5], [6], [7], and sand storm [8], [9], seriously threaten people's lives and property. Monitoring the formation and development of disastrous weather is the basis for weather forecasting. Cloud plays an important role in weather systems since cloud type, cloud phase, and cloud height [10] profoundly affect the generation and development of weather systems. Remote sensing (RS) image is one of powerful tools to monitor clouds and weather systems.
As one kind of RS image, which can get top-down observations of cloud cover and earth surface, satellite images can be used to understand different weather conditions, evaluate their strength and future development trends, and provide all-weather basis for weather forecasts and disaster weather predictions. This article tries to perform monitoring of clouds and weather systems, such as tropical cyclones, extratropical cyclones [11], [12], [13], and other possible disastrous weather [14], through satellite image classification tasks. There are different kinds of classification tasks. From the perspective of different forms of outputs, classification tasks can be divided into single-label classification and multilabel classification. The former task is aiming at finding the most significant label of images, and the latter allows to output multiple correct labels. In terms of describing complex images with multiple objects, multilabel classification is more suitable. Not only label information but also semantic and spatial relationships will be learned by multilabel classification models. On the other side, when we talk about inputs, single-modal and multimodal are two different forms. The former contains only one form of data, image for example, while the latter are data with different ones. Multimodal classification [15], [16], [17], [18], [19] has became a hot topic recently. Various sensors, such as radar, infrared, and camera, can collect various kinds of data. And each of the above-mentioned kind of data can be seen as a modal. Single-modal learning is aiming at finding a mapping from data to its low-dimensional representation, while multimodal learning can further utilize the complementarity of diversified data and extract more powerful joint features. However, most of the existing research works on RS image classification are still focused on single-modal image classification of ground-base images. Different from the satellite images, the ground-based image is captured by a vision sensor located on the ground. And the related ground-based image classification mostly focuses on single image modal classification. For example, Li et al. [20] propose a cloud image detection method based on SVM to remove thick cloud data for reducing the amount of data to improve the efficiency of the data. But without taking other modal information into consideration, it just focus sub-block cloud image that is used as learning samples of SVM classifier. Zhang et al. [21] propose a ground-based cloud image dataset, consisting of 11 categories under meteorological standards as well as CloudNet for ground-based cloud image classification. Haut  In summary, most of research works focus on the single modal ground cloud image classification. Hence, how to understand the weather systems and cloud from satellite image using its multimodal information will be an interesting topic. In this study, large-scale satellite cloud image database for meteorological research (LSCIDMR) [23] is upgraded to a multimodal database named as LSCIDMR database with meteorological element (LSCIDMRME). Different from LSCIDMR, LSCIDM-RME not only has the image modal label information but also has season, month, data stamp, geographic longitude, and geographic latitude information. The LSCIDMRME contains 521 950 multimodal label tags that can provide a more complete description of weather information from multiple angles. At the same time, we design a network framework to fuse the characteristic information of multimodalities. The results of comprehensive comparison experiments show that multimodal image classification can achieve better performance than singlemodal image classification. The main contributions of this article can be summarized as follow.
1) We upgrade original single-modal dataset LSCIDMR into a multimodal dataset LSCIDMRME, which will be uploaded to the IEEE Dataport for attracting more researchers involving deep learning based meteorological research. 2) LSCIDMRME has total six kind of information: image, season, month, data stamp, geographic longitude, and geographic latitude. The total 104 390 images consist of 414 211 multilabels and 40 625 unique labels. And the label of modal season, month, data stamp, geographic longitude, and geographic latitude is one modal corresponds to one image. That is to say, one image has five multimodal labels. In other words, LSCIDMRME consists 521 950 multimodal labels. 3) Multimodal auxiliary network (MANET) for satellite image classification is proposed to fuse multimodal information. MANET consists of three parts: image feature extraction module (IFEM) based on convolutional neural network (CNN), meteorological information feature extraction module (MIFEM) based on perceptron, and layerlevel multimodal fusion. Experimental results show that the proposed MANET can achieve better classification performance than single image modal classification. The reminder of this article is organized as follows. Section II presents related work on single-modal image classification, multimodal image classification, and RS image classification. Multimodal database LSCIDMRME is detailed in Section III. The proposed MANET is shown in Section IV followed by experimental evaluation in Section V. Finally, Section VI gives conclusion and perspectives.

A. Single-Modal Image Classification
Image classification is a basic task in computer vision. From the 10-class gray-scale image handwritten digit recognition task performed on MNIST to 10-class cifar10 and 100-class cifar100 tasks, then to the later ImageNet [24], image classification is accompanied by the growth of the dataset. Nowadays, thanks to datasets containing more than 10 million images and more than 20 000 categories, such as ImageNet, the accuracy of image classification has surpassed that of humans. Classical convolutional networks, such as LeNet, AlexNet, GoogleNet, ResNet, and EfficientNet, utilize deep learning to investigate the problems of single-modal image classification. LeNet [25] is a multilayer neural network trained with backpropagation algorithm that is marked as the emergence of CNN. AlexNet increases the depth of the network and adopts dropout algorithm that is well avoids overfitting and significantly improves the accuracy of image classification. GoogleNet [26] successfully increases the depth of model without increasing the complexity of computation. ResNet [27] gets the highest accuracy of image classification by increasing the depth of neural network. EfficientNet [28] systematically study model compression and confirms that careful balance of network depth, width, and resolution can bring better results. Through this observation, they propose a new zoom method: use simple and efficient composite coefficients to uniformly zoom all dimensions, including depth, width, and resolution. However, the success of those models mentioned above has just improve the accuracy of image classification, none of them takes the advantages of the mutual enhancing between different modalities.

B. Multimodal Image Classification
Single-modal learning was aimed at learning a high-level representation of images, while multimodal learning attempts to extract complementary information of diversified forms of data. According to the classification tasks of different multimodal datasets, different multimodal fusion classification algorithms are designed. Camps-Valls et al. [29] contrive to use a crosskernel function to map two modalities datasets into the same feature space. The versatility of classification has been improved after using this method. Couprie et al. [30] treat the multimodal data as multichannel input data into the CNN. The multichannel input method probably interferes with the classification process. Wang et al. [31] propose a train structure that can train two modalities, respectively, and input the result to two fullyconnected layers. Wang et al. [32] concatenate the activation in the joint loss function to establish the correlation between the two different modals. In summary, there are many different multimodal tasks for classification, detection, segmentation, etc. Thus, we also propose a classification framework for this task focusing on multimodal classification task in meteorology research.

C. RS Image Classification
From the aspect of classification granularity, pixel-level classification (PLC) and image-level classification (ILC) are required by different applications in RS field. The target of PLC is to generate a classification map of the given images. In other words, PLC task is designed to find the corresponding category for every pixel in given images. Some PLC benchmarks, such as Houston2013, 1 Houston2018, 2 and CWI [33], are proposed for various purposes. For ILC tasks, labels are annotated at image level, and ILC can further be subdivided into singlelabel classification and multilabel classification problems. The former is aiming at finding the most significant categories of the given images, while the latter allows multiple correct labels for a single image. Most of existing classification datasets are developed in single-labeled form [34], [35], [36]. However, due to the requirement of describing a complex image with multiple objects, some multilabeled datasets [37], [38] have also been proposed. In terms of the modal of the data, except singlemodal image datasets, multimodal benchmarks are also available, such as above-mentioned Houston2013, Houston2018, and BigEarthNet-MM [39] which is the extended version of BigEarthNet [38].
CNN is mainstream solution in RS image classification. Furthermore, many specific methods are developed in consideration of the nature of RS images. To balance performance and efficiency, a lightweight discriminative model [40] and LCNN-BEF [41] have been proposed. What is more, LCNN-BEF considers the validity of both deep and shallow CNNs, and for the same reason, best representation branch model [42] and SCCov [43] have also been brought forward. Inconsistencies in scales of RS images motivated the appearance of SEMSD-Net [44] and SF-CNN [45]. To fully utilize the spectralwise information of hyper-spectral images, HybridSN [46] and mixedconvolution [47] design 3D-2D hybrid CNNs. For narrowing the gap between the amount of annotated data and raw RS data, semisupervised and unsupervised methods, such as GANbased method MARTA-GAN [48], and Attention-GAN [49], similarity-based auxiliary training method Siamese-CNN [50], and kernel collaborative representation [51] are proposed. To take multimodal inputs, FUSION-FCN [52], deep-shallow [53], and two-branch network [54] have been developed, and modal fusion techniques are studied in detail in [55].
Related works discussed above are for single-label RS image classification, as for multilabel RS image classification, not only the most significant semantic representation but also semantic and spatial relationships between different labels should be learned by the model. Specifically in RS field, some approaches are directly transplanted from general CV filed, using off-theshelf deep learning tools, such as CNN [56]. However, there are also some methods especially designed for RS images. Two-branch network [57], [58], attention mechanism [59], [60], [61], and GCNs [62] are introduced in RS field to model abovementioned semantic and spatial relationships between objects in images.
In the formation and development of Weather systems, cloud plays an important role. Cloud and Weather classification via images is of great significance. [63] uses traditional method to extract the feature of satellite cloud imagery. Modern deep learning methods are powerful tools for solving cloud and weather image classification tasks. Li et al. [64] detect and classify clouds with Deep neural networks from the perspective of radiance. Except RS images, ground-based cloud images are also explored in cloud classification [21].
In conclusion, there are limited researches on cloud and weather system classification especially based on satellite images and deep learning. And this article, to some extent, is filling such a research gap.

III. LSCIDMRME: LSCIDMR DATABASE WITH METEOROLOGICAL ELEMENT LABEL
LSCIDMR [23] is the first public available large-scale cloud image database for meteorological research. This database has 104 390 images with sizes of 1000*1000 pixels. Two forms of database are available, single-labeled LSCIDMR-S and multilabeled LSCIDMR-M. Table I lists detailed information of LSCIDMR. The ratio of a specific label in LSCIDMR-S equals the number of that label divided by the total number of labels in the database without Labeless. The ratio of a specific label in LSCIDMR-M equals the number of that label divided by the total number of images in the database. Fig. 1 gives two image examples of each categories of LSCIDMR-S. One image is annotated with one label in LSCIDMR-S that could not show the rich information. Thus, LSCIDMR-M have a total 414 211 multiple labels that could provide more information in an image. The second and third columns of Table II give five examples of two different annotation methods.
However, all the labels of LSCIDMR are only from the perspective of image information, which is not enough in recognizing clouds and weather systems. In fact, weather conditions have the essential connection with the geographic location and seasonal information. Hence, such information elements labels are added as follows: Season, Month, Date, Longitude, and Latitude. Adding these geographic information elements will enrich the LSCIDMR from the image to seasonal and geographical information. The motivations of choosing such five elements are as follows.
Season [65]: There are several different stages of climate change in a year that can be generally divided into-spring, summer, autumn, and winter. Fig. 2 shows the statistical analysis of typical weather systems in different seasons. From this figure, we can see tropical cyclones and extratropical cyclones in all seasons, but we observe tropical cyclones mostly in summer  Month: Different catastrophic weather has different probabilities in the early, middle, and late stages of the same season. Thus, we add the Month as a factor for meteorological research.
Date stamp: A date is a specific time that can be named, for example, a particular day or a particular year. The probability of same type of disastrous weather occurring in the first ten days of the same month, the middle ten days, and the second ten days of the same month is also different. Hence, we take the Data stamp into the consideration.
Longitude and latitude: The probability of different severe weather occurring in different geographical locations is also different as different geographic locations have different geographic characteristics, such as oceans, deserts, vegetation, and etc. All of them have a certain impact on the formation of weather systems. Thus, the geographic information including longitude and latitude is also added in the database.
The weather system [66] is a very complex system and many factors must be considered. However, information mentioned above is added as it can be extracted from Himawari-8 satellite directly.

A. Overview
The structure of our MANET is shown in Fig. 3. And the Algorithm 1 shows the pipeline for training MANET. Our framework  contains three main modules: IFEM, MIFEM, and multimodal fusion based on neural network (MFN). The main purpose of IFEM is to use deep learning method to extract the image representation from images. And MIFEM module contains two main steps. First, it performs nondimensional data processing on each geographic information element and then performs feature extraction on the processed data with a multilayer perceptron. MFCN is a self-designed neural network to fuse the image and meteorological feature information extracted from IFEM and MIFEM, respectively. Following the feature extraction of the fused multimodal representation information, multifeature information is classified. These three modules are detailed in the following three sections.

B. Image Feature Extraction Module
The main task of IFEM is to use CNNs to extract highdimensional features from images. The four kinds of layers in CNNs are: convolution layer, pooling layer, fully connected layer, and activation layer. Convolutional layer is used for image feature extraction. The pooling layer compresses the input feature map. On the one hand, it makes the feature map smaller and

C. Meteorological Information Feature Extraction Module
MIFEM is used to extract the feature of meteorological information other than images. Fig. 4 is the flowchart of overall processing of this module that mainly includes two parts: data processing and meteorological feature extraction.
In the practice of machine learning algorithms, we often need to convert data of different specifications to the same specification or to convert data from different distributions to a specific distribution. This requirement is collectively referred to as "dimensionless" data. Linear dimensionless [67] includes centering (Zero-centered or Mean-subtraction) processing and scaling processing (Scale). The essence of centralization is to subtract a fixed value from all records, that is, to move the data sample data to a certain position. The essence of scaling is to fix the data in a certain range by dividing by a fixed value. Taking the logarithm is also a kind of scaling process. As for the characteristics of meteorological information, we choose the Min-Max scaling method to process it. When the data are centered according to the minimum value and then scaled by the range (maximum-minimum), the data move by the minimum unit and will be converged to between [0,1], and this process is called data as Min-Max Scaling. x i in (1) represent the ith data in the modal. min(x i ) and max(x i ) represent the smallest value and the largest value in this modal, respectively. x i * is the ith data after Min-Max Scaling processing.
After processing each modal data, we input the processed data into the multilayer perceptron built by ourselves. Our selfbuilt multilayer perceptron mainly includes two fully connected layers. In (2), let i as the subscript of the previous layer of neurons or the input layer node, j as the subscript of the current layer of neurons or hidden layer of neurons, and w ij represents the weight of each neuron in the previous layer to the current neuron, that is, the weight of neuron j. h j represents the weighted sum of all inputs of the current node.
In (3), a j represents the output value of the hidden layer neuron. (3) f meteorological in (4) represents the output value of the output layer, it is also meteorological feature extracted from the network structure. h k represents the input weighted sum of neurons k in the output layer.
We add dropout [68] to the multilayer perceptron we established to prevent possible overfitting of the model. Briefly speaking, dropout is to let the activation value of a certain neuron stop working with a certain probability p when the network is propagating forward, which can make the model more generalized because it will not rely too much on certain parts characteristics.

D. Multimodal Fusion Network
Multimodal fusion network is used to get joint feature f multimodal after extracting features from image modal and meteorological modal. What is more, MFN further extract the fused features for final classification. f image and f meteorological features extracted from IFME and MIFEM, respectively, are the input of our self-built MFN. The input of this module is the features extracted from two front modules. Unlike the IFME, using mainstream CNN as the feature extractor, MFCN is designed on the basis of multilayer perceptron. And the purpose of MFCN is to extract the multimodal feature f multimodal . MFCN mainly includes three fully connected layers and choose ReLU as activation function. And we also add two dropout layers after first and second fully connected layers in order to avoid overfitting. The activation function selected in the last fully connected layer is different according to the specific classification task. We choose the softmax activation function for the single-label classification. In (5), k represents the total number of outputs in MFN, Z j represents the jth original output value. k=1 k e Z k represent all the factors of the original output value, which means that the different probabilities obtained by the Softmax function are related to each other.
Different from single-label classification, we choose sigmoid function [69] for multilabel classification to project output logits to probability domain because in multilabel classification problem, for one image, multiple correct answers exist and the separate processing of different logits is needed. In (6), Z j represents the jth original output value.
We get the final classification result based on the output value of the final activation function.

V. EXPERIMENTS AND ANALYSIS
Experiments are conducted on a NVIDIA Quadro RTX A6000 GPU with 48 G memory. LSCIDMRME is composed of LSCIDMRME-S and LSCIDMRME-M, dealing with singlelabel and multilabel. Hence, two groups of experiments are carried in this section.

A. Experiments on LSCIDMRME-S 1) Baseline Modal Used in the Image Feature Extraction
Module: We utilize three classical CNNs as the base model in IFEM in the experiments: AlexNet [24], ResNet-101 [70], and EfficientNet-B5 [28]. These base models and corresponding CNN part of MANET are pretrained on ImageNet, and other 2) Parameter Setting: Two different training and testing ratios are taken into consideration in order to get a more comprehensive evaluation: 10% and 20%. For the former, 10% of data in each category is used for testing and the rest for training. For the latter, 20% of the data in each category is used for testing, while the rest is served as the training set. During training, the input to the CNN model is a batch of RGB images, whose sizes are fixed at 256 × 256 pixels. Simple data augmentation such as vertical flipping at random with a certain probability and proportional cropping is performed. We choose cross-entropy loss as the loss function, and stochastic gradient descent (SGD) is selected as the optimizer. The momentum rate of SGD is set as 0.9 and the learning rate is initialized as 1 × 10 −5 . Training will last 100 epochs and the learning rate will be decreased by a factor of 5 every 20 epochs.
3) Evaluation Metrics: Overall accuracy [71] and confusion matrix [72] are used to evaluate the performance of image classification models. We perform training of these networks with 10 epochs, and the means and standard variances of the overall accuracy for each epoch would be calculated. During the training process, the model that can get the highest means and standard variances of the overall accuracy is saved as best models for the computation of confusion matrix. And then, through each of these best models, the correct and incorrect classification of each category would be calculated and put to corresponding position of the confusion matrix. Table III presents the means and standard variance of the overall accuracy of each model with different testing ratio. From this table, it can be observed that the proposed MANET is more effective compared to the classification accuracy of a single modal. The overall classification accuracy of the proposed method has been improved after fusing the information of other meteorological modalities. Compared with the test ratio 20%, the classification accuracy got when the test ratio is 10%, reaches a better result. As for 10% testing ratio, there is more data that can be used in training and get more information during this process. And MANET-ResNet-101 achieves the highest classification accuracy compared with others. Figs. 5 and 6 show confusion matrices of baseline models and MANET, respectively, under different test ratios. Comparing these two figures, it can be told that with the same baseline model, the addition of meteorological information would improve the classification accuracy of each category, especially for weather systems and clouds. Take Frontal surface as an example, in Fig 5(a), the accuracy of it through AlexNet is 0.30, and if we add meteorological information into model as the form of MANET, the accuracy will improve to 0.43, as shown in Fig. 6(a). This kind of improvement of specific categories can be seen on different baseline models and test ratios. Above all, the accuracy of weather systems and cloud types are improved to varying degrees with the help of meteorological information. Thus, it is proved that the purpose of improving the performance of identifying weather systems and clouds is achieved by MANET. This improvement comes from the introduction of prior knowledge about what weather systems or cloud would emerge at what time and location. What is more, because 60% of samples are in the Labeless category, a shortcut solution for models is simply predicting unrecognized images as Labeless, and this cheating way ensures 60% correct probability of guessing. But we can tell from confusion matrices that this harmful situation will be mitigated by MANET since the number of false Labeless of MANET is smaller than baseline models.

4) Results and Analyses:
B. Experiments on LSCIDMRME-M 1) Parameter Setting: For LSCIDMRME-M, the setting of training and testing ratio is the same as LSCIDMRME-S: 10% and 20%, respectively. The network structures mentioned above are utilized, but some slight modifications in the structure are applied since there are some differences in the processes of multilabel classification and single-label classification. The input image size of networks is 256×256 pixels, same as LSCIDMRME-S. The activation function is changed to Sigmoid [69] in the added fully connected layer of these networks; the loss function, further more, is replaced by binary cross entropy [73], [74]. Sigmoid is utilized as activation function to map the vector of each category's prediction score into the probability domain, which is in the range of 0-1. The threshold is set to 0.5. If the prediction score of a sample for a category is greater than this threshold, this sample is then categorized as that class. We still chose SGD as the optimizer. The initial learning rate and adjustment strategy of learning rate are the same as the above-mentioned experiments on LSCIDMRME-S.
2) Method Standard: Precision, recall, accuracy, and Abso-luteTrue are introduced as four indicators to evaluate our multilabel classification models; the specific formulas and principles of these four metrics are as follows. For a better illustration of the relationship between the ground truth and the predicted labels in a same patch, we draw a chart in Fig. 7. Given N as the number of patches in the dataset and L k as the subset that contains every label for the kth patch, while L * k denotes the subset which contains every predicted label for the kth patch. In (9), the formula L k ∪ L * k represents that every element in this set is a member of L k or L * k . And an element in L k ∩ L * k in (7)-(9) denotes a sample that belongs to both L k and L * k . || * || denotes the number of elements of a specific set.  Precision [75]: Precision is the ratio of the number of properly predicted labels to all predicted labels.
Recall [75], [76]: Recall is the ratio of the number of correctly predicted labels to the real labels.
Accuracy [71]: Accuracy is the ratio of correctly predicted labels to the total labels including correctly and incorrectly predicted labels, those real labels missed in the prediction are also included.
AbsoluteTrue [76]: In the kth sample, when and only when all its label(s) predicted are identical to its true label(s) can be scored with 1; otherwise, 0.
To calculate the four metrics mentioned-above, during the training process, the best model of each deep learning network in different training and testing ratios is saved.
3) Results and Analyses: Different metrics of baseline models and MANET are listed in Tables IV and V, respectively. The addition of meteorological information can improve the performance of multilabel classification on most metrics of all three baseline models. What is more, we can draw similar conclusions with single-label classification experiments that smaller   DIFFERENT TEST RATIOS  TABLE VII  MULTIMODAL IMAGE CLASSIFICATION ON LSCIDMRME-M: EXAMPLE OF LSCIDMR-M IMAGES WITH THE TRUE MULTILABELS AND THE MULTILABELS  ASSIGNED BY DIFFERENT METHODS IN DIFFERENT TEST RATIOS test ratio can bring better performance due to a bigger training set. Generally speaking, consistent with the experiment results on LSCIDMRME-S, Tables IV and V proved that MANET has the ability to improve overall performance of classification.
From the perspective of model implementation, the multilabel classification task is done by changing the last Softmax function of single-label classification model to Sigmoid function. In other words, we canceled the incompatible assumptions of singlelabel classification to make model has the ability to predict multiple labels. This make our models actually groups of binary classifiers that can model the different labels separately. As given in Tables VI and VII, some examples of prediction results are listed. We can see that the first example image in Table VI is annotated with Tropical Cyclone and all baseline models cannot identify that label, but with the help of meteorological information, MANET-AlexNet and MANET-EifficientNet-B5 are able to detect such labels. From these two tables, we can also see that MANET is helpful for identifying land cover labels such as Desert and Vegetation. These proved that the proposed MANET has better performance on modeling weather systems, clouds, as well as other land cover labels.

VI. CONCLUSION
In this article, we upgrade the original single-modal dataset LSCIDMR into a multimodal dataset LSCIDMRME. Compared with LSCIDMR, LSCIDMRME has 521 950 multimodal tags.
The new multimodal dataset will be uploaded to the GitHub and IEEE Data port. And we design MANET for satellite image classification by fusing multimodal information of LSCIDMRME. And the experimental results reflect that the proposed framework is able to achieve better classification performance than a single image modal classification. The purpose of identifying specific cloud types and weather systems also benefits from MANET through RS image classification tasks.