Interband Retrieval and Classiﬁcation Using the Multilabeled Sentinel-2 BigEarthNet Archive

—Conventional remote sensing data analysistechniques have a signiﬁcant bottleneck of operating on a selectively chosen small-scale dataset. Availability of an enormous volume of data demands handling large-scale, diverse data, which have been made possible with neural network-based architectures. This article exploits the contextual information capturing ability of deep neural networks, particularly investigating multispectral band properties from Sentinel-2 image patches. Besides, an increase in the spatial resolution often leads to nonlinear mixing of land-cover types within a target resolution cell. We recognize this fact and group the bands according to their spatial resolutions, and propose a classiﬁcation and retrieval framework. We design a representation learning framework for classifying the multispectral data by ﬁrst utilizing all the bands and then using the grouped bands according to their spatial resolutions. We also propose a novel triplet-loss function for multilabeled images and use it to design an inter-band group retrieval framework. We demonstrate its effectiveness over the conventional triplet-loss function. Finally, we present a comprehensive discussion of the obtained results. We thoroughly analyze the performance of the band groups on various land-cover and land-use areas from agro-forestry regions, water bodies, and human-made structures. Experimental results for the classiﬁcation and retrieval framework on the benchmarked BigEarthNet dataset exhibit marked improvements over existing studies.

Abstract-Conventional remote sensing data analysistechniques have a significant bottleneck of operating on a selectively chosen small-scale dataset.Availability of an enormous volume of data demands handling large-scale, diverse data, which have been made possible with neural network-based architectures.This article exploits the contextual information capturing ability of deep neural networks, particularly investigating multispectral band properties from Sentinel-2 image patches.Besides, an increase in the spatial resolution often leads to nonlinear mixing of land-cover types within a target resolution cell.We recognize this fact and group the bands according to their spatial resolutions, and propose a classification and retrieval framework.We design a representation learning framework for classifying the multispectral data by first utilizing all the bands and then using the grouped bands according to their spatial resolutions.We also propose a novel triplet-loss function for multilabeled images and use it to design an interband group retrieval framework.We demonstrate its effectiveness over the conventional triplet-loss function.Finally, we present a comprehensive discussion of the obtained results.We thoroughly analyze the performance of the band groups on various land-cover and land-use areas from agro-forestry regions, water bodies, and human-made structures.Experimental results for the classification and retrieval framework on the benchmarked BigEarthNet dataset exhibit marked improvements over existing studies.Index Terms-Interband retrieval, multilabel classification, multilabel cross triplet loss, multimodal classification, Sentinel-2, land-cover classification.

I. INTRODUCTION
I MAGES from multispectral and hyperspectral sensors have found wide applications, ranging from mining [1], oceanography [2], agriculture [3], meteorological studies [4], geological observations [5], to name a few.Multispectral satellites consist of several spectral bands, which image the land surface with multiple spatial resolutions.Each spectral band essentially captures specific physical information from these distinctive land surface covers.This information essentially depends on the interaction of the electromagnetic waves of particular wavelengths with the physical and geochemical characteristics of the land cover surface within a sensor resolution cell.Therefore, this information helps in efficiently characterizing different land cover classes, such as vegetation and water-bodies.
Understanding the data has become crucial using neural networks, followed by advances in various deep learning frameworks.While using a conventional dense network, we usually neglect the information about the neighborhood pixel.This information holds vital information about the change of pixel characteristics [6].For example, let us consider an image that consists of a water body and a beach.A conventional dense network might not focus on the boundary of these two land features.However, a convolutional neural networks (CNNs) considers the spatial heterogeneity in terms of their spatial distribution and neighborhood pixel information.Hence, CNNs will be able to differentiate these various land cover types along with their corresponding boundaries.Moreover, the main advantage of CNNs is that it automatically detects essential features from contexts [7].
Several studies aim to bridge the gap between the feature embeddings of multisensor imagery.However, as different sensors acquire images over a different time, a particular region in the acquired data may suffer from changes in the local weather or land cover (during a harvest season or post a natural disaster).In such a case, these multisensor acquired images are more fit for a change detection task rather than a fusion or cross/multimodal retrieval task.Drawing motivation from this, we aim to look into multimodal data classification and retrieval wherein there has been no change within the acquisitions.We choose to consider imageries acquired by a single satellite for this objective and propose using different bands grouped based on their spatial resolution as different data modalities.This article primarily aims to study classification and retrieval among interband groups with precisely the same region data without any externally induced change.
To utilize all the bands together, we either downsize, interpolate, or super resolve a few bands to bring them to a common spatial resolution.However, a significant drawback in multilabeled data is that we often get a nonlinear mixing of end-members within a target land-cover region with increased spatial resolution.Therefore, if we downsize the images, we end up missing a considerable amount of information.Hence, in this study, we group the bands according to the band This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/spatial resolutions and propose a classification and retrieval framework.
Several studies recommended using handcrafted feature extractors to develop a robust multilabel classification model.The normalized difference vegetation index (NDVI) is the most widely used descriptor for green vegetation region detector [8].Likewise, there are several such conventional handcrafted indices used for detecting various land-cover regions.With the onset of CNN-based feature extracting procedures, the complexity of different classes exploded with deep neural networks predominantly handling the classification task [9], [10].Several classification studies have been reported using multispectral satellite data [9], [11]- [13].
While classification has remained a classical problem description in remote sensing (RS), retrieval tasks have found more attention recently with extensive data acquisition by various satellite missions using different sensor technologies.Therefore, this challenges having an interband group retrievable framework [14].Since each band provides distinct land-cover reflectance properties, it is imperative to have an interband group retrieval framework.Recently, we have seen a lot of focus on cross-sensor/cross-modal retrieval techniques in RS using various learning techniques [15], [16].Some of the notable works in this domain are presented in [17]- [20].Several literary works in this domain have tried to exploit the conventional triplet loss or the Siamese loss function to discriminate classes within a fine-grained dataset.One major shortcoming in using this for a multilabeled dataset ignores the presence of common classes between a positive class image and a negative class image to the anchor image.This condition leads to the learning up of scattered clusters for each class in the embedding space.To overcome this concern, we propose a modified triplet loss for multilabeled images.
Once we establish an interband group retrieval framework, it is easier to group certain bands based on their spatial resolution and consequently study the band properties of those modalities.To better understanding channel properties in SAR RS applications, various works have been proposed (e.g., [21], [22]).However, to the best of authors knowledge, no such work has been carried out on multispectral sensors, which uncovers many more widespread applications.We elaborate on the classification and retrieval performance of our framework on various band groups and comprehend the overall band group properties.We show an overview of the problem statement in Fig. 1.
How are we different?In this work, we perform a comprehensive study of the band properties of the Sentinel-2 multispectral data.First, we design a multilabel classification network for classifying the BigEarthNet dataset.The proposed network is better than the state-of-the-art (SOTA) and yields a good performance.
We then split the data into three groups based on the spatial resolution of each band.For each of these groups of bands, we test the interband multilabel classification performance.We utilize the same model, which outperforms the SOTA in classification with all the bands combined for this purpose.While keeping the network architecture intact, we vary the number of convolution kernel channels depending on the number of bands in the corresponding modality.The proposed method produces good discriminative features to describe the multilabeled land cover classes with lower dimensions.
Furthermore, we propose an interband multilabel retrieval architecture using a novel triplet loss function.We demonstrate that this function is better than the conventional triplet loss function for multilabeled images.This study provides an assessment of bands that effectively contribute toward better land cover retrieval tasks.Finally, we also study the properties of each band and explore their contribution in classifying each land use/land cover class.None of the literary works using multispectral satellites have studied its band properties to the best of our knowledge.
We summarize the main contributions of this article as follows.
1) We propose a multilabel classification network using representation learning using all the bands of the Sentinel-2 that performs comparably to the SOTA performance on the large-scale benchmarked BigEarthNet dataset [23].2) Using the abovementioned framework, we analyze the multilabel classification performance of the band groupwise (bands clubbed together by their spatial resolution).
We also find class-wise identification of each land-cover class in different band groups.3) We propose a novel modified cross-triplet loss-based metric learning technique for retrieving multilabeled images and demonstrate its efficacy over the conventional triplet loss in the experimental section.4) Using the proposed modified cross-triplet loss, we design an interband group retrieval framework among these modalities.

II. RELATED WORKS
With the onset of deep learning technologies, the past decade has successfully handled various heterogeneous applications in RS.The deep learning-based approach primarily has demonstrated its superiority in feature extraction in several types of satellite images.This strategy has opened up a new research extent in RS for data classification and retrieval.The following sections discuss current relevant work using deep learning in: 1) classification; 2) retrieval; 3) understanding band properties, respectively.

A. Classification in RS
In RS, deep learning techniques have been successively and successfully utilized in classifying images acquired from different sensors.Different sensors capture different information, which necessitates having robust classifiers compatible with various forms of RS data.For multispectral satellite data classification, various studies have been conducted [9], [11]- [13].Chaib et al. [9] proposed a simple framework by exploiting the features constructed by a VGG pretrained network [24].They perform a discriminant correlation analysis using these features for refining the original features using information fusion, enabling them to obtain a good scene classification framework.Xia et al. [11] proposed a benchmarked dataset for aerial scene classification.In [12], the authors reported a similar very-high resolution land cover classification by augmenting the RS data and using a standard transfer learning from a pretrained network.Cheng et al. [13] reviewed the important literature in RS for scene classification while also proposing another benchmarked dataset.Similarly, quite a few studies have also reported hyperspectral image classification [25]- [28], synthetic aperture radar (SAR) classification [29], [30], polarimetric SAR (PolSAR) classification [21], [31], RS object detection [16], etc.
Xu et al. [32] classified land cover types using the features derived from a CNN architecture.They utilized the derived features directly into the support vector machine classifier for the classification purpose.According to their study, these derived features enhanced the classification accuracy by 2.65% as compared to other traditional methods.In another study, Runyu et al. [33] proposed a semisupervised multi-CNN ensemble learning method to classify different land cover types.Their proposed method outperforms other existing methods by 3% to 4% overall accuracy score.Therefore, these extracted features are essential in classifying several land cover classes within a study area [34]- [39].Moreover, CNN has gained importance in land cover classification due to its flexibility and adaptation capability in several land cover scenarios for different sensing platforms.

B. Retrieval in RS
There exists a plethora of work in the literature on image retrieval in RS [40]- [43].Most retrieval frameworks create a hashed feature space, as it is much faster to search for the nearest neighbor using a hamming distance measure.In [41], the authors proposed a kernel-based nonlinear hashing technique for retrieving large-scale RS archives.They leverage the semantic similarity of the annotated images in the dataset to construct the hashed space.Xia et al. [44] provide a detailed review of all the important literary works in RS unimodal data retrieval.
While most of the literary work is on unimodal data retrieval, exploiting the robust feature extraction capability of CNNs, new studies are appearing that address cross-modal data retrieval.The need for multimodal approaches is particularly more evident in the area of RS data analysis due to the availability of a large number of satellite missions with various sensors and complementary information obtained by multiple cross-sensor acquisitions [14], [19].A similar study [15] has addressed this strategy using SAR and multispectral images and thereby has proposed the SEN12MS dataset, which consists of Sentinel-1 and Sentinel-2 images over the same patch across various geographical locations.SAR images provide backscatter information of a target at microwave wavelength that penetrates the atmospheric layer and often provides added information than multispectral data.These cross-sensor retrieval techniques also extend to retrieving unseen class images upon deployment, commonly called zero-shot cross-modal retrieval [45].As one of the major bottlenecks of solving RS problems using deep learning is the lack of annotated samples for training, these cross-sensor zero-shot retrieval has received a lot of attention recently [18].

C. Physical Parameter Estimation or Understanding Band Properties
Various studies have been proposed using multiple sensors to interpret physical parameters from RS data.For example, one uses the double-bounce scattering mechanism to classify urban areas and city blocks.In contrast, volume scattering helps detect forests and dense vegetation areas from PolSAR data.Similarly, surface scattering characterizes flat terrains and ocean covers.Zhao et al. [21] used a contrastiveregulated CNN to learn the physical parameters from Pol-SAR images.In [22], the authors utilized SAR images to learn the spatial texture and backscattering patterns from target areas.
Several studies have proposed classification by exploiting the physical properties of each spectral band and the corresponding target behaviors using multispectral data [46].However, most of these studies directly utilize these bands for specific tasks of classification and retrieval.For example, one uses thermal infrared bands to measure the land surface temperature changes.Similarly, bands three, five, and seven from Landsat-8 are used for snow cover detection [47].Likewise, each band has its unique physical characteristics, which are suitably exploited for various  applications.We tabulate specific critical applications using each spectral band of Sentinel-2 as reported in the literature in Table I.
Robinson et al. [48] proposed a multiresolution data fusion method for high-resolution land cover mapping.They identify the challenges of deep learning-based land cover mapping and propose techniques to overcome these challenges.They primarily tackle a multimodal/multiresolution data fusion task to develop an efficient land cover mapping framework.To this end, the authors use a large-scale database comprising over eight trillion pixels to train the model.

A. Data Preparation
The Sentinel-2 satellite has 13 bands with different spatial resolutions and different sizes of images.Sumbul et al. [23] reported the BigEarthNet dataset that was created using the Sentinel-2 images.Here, we demonstrate our work on this BigEarthNet dataset.Fig. 2 shows the true color composites (TCCs) and false-color composites (FCCs) of a few sample images from the BigEarthNet dataset over a diverse subset of land-cover classes.The FCCs are made using bands 8, 4, and 3, while the TCCs are made using bands 4, 3, and 2.
In the literature, the spatial resolution is shown to be one of the critical parameters for the classification of diverse land cover types [49], [50].The effectiveness of any classification or retrieval problem primarily depends on the texture information in the data and the existence of pure end-member within a pixel.With an increase in the spatial resolution, the possibility of detecting a pure end-member in a pixel reduces due to the high chance of nonlinear mixing of several other end-members within the resolution cell.This characteristic becomes especially more crucial and challenging for multilabeled RS datasets.
It is noteworthy to say that, due to the tradeoff between spectral and spatial resolutions, lower frequency, and high spectral resolution bands require coarser spatial resolution.The BigEarthNet multilabeled data obtained from the Sentinel-2 sensor have three different spatial resolutions of bands at our disposition depending on their bandwidth and central frequencies.Therefore, we have considered grouping the bands based on their spatial resolution from the available bands in Sentinel-2 data in this study.We The band number 10 (out of 13 bans) was excluded from the dataset due to the lack of surface information.Therefore, we have two bands with 60 m resolution, i.e., band number 1 (coastal aerosol) and 9 (water vapor), which we refer to as M1.There are six 20 m bands, i.e., band number 5 (vegetation red edge), 6 (vegetation red edge), 7 (vegetation red edge), 8 A (vegetation red edge), 11 (short wave infrared), and 12 (short wave infrared), which we refer to as M2.There are four 10 m resolution, i.e., band 2 (blue), 3 (green), 4 (red), and 8 (near-infrared), which we refer to as M3.
For the initial classification, we combined band groups (i.e., M1, M2, and M3).We simultaneously obtain the three groups of data from the same satellite (i.e., same sensor).In this work, we also use these different groups, i.e., M1, M2, and M3, separately to analyze their contribution to each land-use/land-cover class.Further investigations involve group-wise classification and interband group retrieval along with band properties interpretation.
The BigEarthNet is a multilabel dataset with labels from 43 different land cover categories.These labels were used from the Corine land cover database.The dataset consists of a total of 590 326 image patches generated from 125 Sentinel-2 image tiles.These tiles were selected from data acquired from June 2017 to May 2018.

B. Training Data Imbalance
Fig. 3 shows the bar graph of the number of image instances corresponding to each land-cover class.One can see that the difference between the most and the least frequently occurring classes varies in the range of approximately more than 200 000 samples.This disparity causes a large bias on the network to learn the more commonly occurring classes.The less frequently occurring classes are seldom learned.So in the training sequence, we need to carefully take care of this fact by either using the weighted average of the number of samples in each class distribution or cleverly select the batches while training the network.
Some of the commonly used strategies for handling data imbalance in the literature are based on rebalancing the dataset and cost-sensitive learning of classifier [51].Naive-over and undersampling [52], selective decontamination [53], SMOTE [54], GAN-based augmentation [55] are some of the commonly used rebalancing techniques.The cons-sensitive learning approach involves adding focal loss [56] and diversity regularizers [51].

IV. METHODOLOGY
Preliminaries: The idea behind the multilabel classification task is to find a mapping function that can help get the labels given an input image.We conduct our studies through the following steps.
1) Multilabel image classification by individual band groups (M1, M2, and M3) and studying the importance of each band group for each class.2) Designed an interband group, multilabel image retrieval network.3) Understanding the band properties and their analysis.We explain these steps in the following sections in detail.

A. Interband Classification
Multilabel image classification is a more challenging problem than single-label image classification algorithms.This is mainly due to preserving the accountability of every minute detail from different categories of images.For a dataset consisting of Y = {1, 2, . . ., L} distinct land-cover classes, we get Y i ⊆ 2 L corresponding combinations of possible land-cover multilabels.Here, 2 L is the power set of the set of all labels.This shows how the challenge of classifying multilabel images increases in many folds.
Ideally, we use a softmax function after the last layer of the neural network for a single-label classification problem.After the last layer, we use a sigmoid activation function for a multilabel classification framework to get the class labels.Whether a class is present or not is given by an indicator function, which we set high only when the probability of getting a class is higher than some preset threshold.Formally we define this as given in equation 1.Here, A n is the indicator function for class n, p n is the probability for nth class, and threshold α ( We try to keep the precision to recall ratio as close to one as possible.If the ratio is less than one, we increase the threshold constant α.Similarly, if the ratio is greater than one, we reduce the value of the threshold constant.We tune the threshold value in this way by a grid-search-based method. We use the three spatial resolution-wise compiled groups of the data (M1, M2, and M3) from the same satellite for this set of experiments.Since we consider all the three modalities from the same satellite, all the groups capture the image simultaneously.There is no time delay between them.Here, we train the previously designed network for a group-wise multilabel classification to identify the land-cover classes of the image patches.
Training: We design a representation learning network for tackling the problem of multilabel classification.We apply a cubic-interpolation to the 20 m and the 60 m bands of each image to bring them to the same image dimension as the 120pixel images.We choose to interpolate the data to be consistent with the other state-of-the-art architectures [23], [57], [58] while preserving comparison fairness among them.
We use four sets of convolution-pooling-nonlinearity blocks with the filter sizes as The size of the initial convolution kernel depends on the group that we are using.For M1, we use four channels, six channels for M2, and two channels for M3.We use the leaky_ReLU(.)function to inject nonlinearity into the designed model.We also use batch normalization after every convolution and dropouts after every pooling layer.Instead of using only the max-pooling layer, we perform max-pooling on half the channels and average pooling on another half of the channel.This is then followed by adding deconvolution layers and up-sampling.The middle layer that gets created acts as the bottleneck layer that consists of most of the propagated information at a much smaller dimension.We add an auxiliary classifier at this bottleneck layer as used in InceptionNet [59].The auxiliary classifier is attached to intermediate layers of the network, and it helps in improved convergence during training by combating the vanishing gradient problem.The auxiliary classifier prevents the bottleneck weights from dying out.Besides, two layers of deconvolution layers of the dimensions are similar to the convolution layers.We perform up-sampling before each deconvolution layer.The final layer is followed by two fully connected layers of dimensions 256 and 128.Finally, this is followed by a sigmoid activation function after the last layer to get the multiclass labels.The final loss is a sum of the auxiliary classifier and the final layer multilabel classification loss.
We split the dataset into a 70:30 train:test ratio.To account for the considerable class imbalance and ensure proper training, we made training batches comprising of at least one instance of each class in every batch.This ensures that all the classes have the same contribution to the model in the training process.We initiate the model using Xavier weights.Since we do not have much information about the data, Xavier assigns weights from a Gaussian distribution with zero mean and some finite variance.Xavier weights keep the variance the same in each passing layer, preventing vanishing, or exploding gradients problems.We train the network using the standard back-propagation algorithm using a mini-batch stochastic gradient approach and minimizing a momentum optimizer.

B. Interband Group Retrieval
This section aims to learn a shared latent space equivalently representing all three band groups, M1, M2, and M3.The main idea behind designing this unified latent space is to bring the similar class instances of different band groups nearby while pushing diverse class instances of different band groups far apart from each other.This effectively implies that we reduce the intergroup distance and increase the intragroup distances within each class.
To synthesize this common shared space, we design a pipeline, as shown in Fig. 4. We extract the 128-d features from the previously trained network for multilabel, band group-wise classification.We save this final layer feature weights corresponding to each image instance from the corresponding trained classifier.We add a series of fully connected neural network layers from the three pretrained band groups.For this part, we stack three fully connected layers of dimensions 128, 128, and 64.

Modified-cross-triplets loss:
The conventional triplet loss is a type of similarity learning loss function.This work uses the basic idea of training a multilabeled triplet loss in conjunction with satellite images from different band groups.We aim to design a domain-agnostic latent space for an interband group retrieval setup.To achieve this, we employ three branches of fully connected networks from the input data stream.Each of these data is derived from the pretrained weights of each instance.We use the proposed modified cross-triplet loss-based network to train this network with learnable parameters represented by θ.We train the network until it reaches a very low loss value of .
For this purpose, we sample the positive exemplars, by considering image samples from different band groups comprising the same classes.For an image m n 1 ∈ M 1 comprising of N multilabels c 1 , c 2 , . . .,c n ∈ C, any image from m n 2 ∈ M 2 or m n 3 ∈ M s , having either the same ground-truth labels or any subset of the ground-truth labels c 1 , c 2 , . ..c n is considered as a positive exemplar.An instance which has labels apart from c 1 , c 2 , . ..c n along with a subset of these labels are also considered as a valid positive exemplar.
Conversely, to select the negative examples, an instance from a separate band group, having any labels apart from c 1 , c 2 , . . .,c n is considered a negative example for that image.Therefore, while selecting the negative exemplar for a given training sample, we ensure that y p i ∩ y n i = ∅.The standard triplet loss is defined as (2), where A denotes the anchor image, P represents the positive exemplar, and N denotes a negative example If the distance between the anchor and the positive pair is more than the distance between the anchor and its negative pair, we update the weights by minimizing the loss function.We try to bring the positive sample close to the anchor and push the negative sample farther from the anchor beyond a margin α.In this case, we do not consider the distance between the positive and the negative pairs.Since the data has multiple labels, we modify the loss function to make the learning more robust.The proposed similarity loss is given in (3).Here, the first loss term is similar to the conventional triplet loss function.However, we use the anchor, positive, and negative data instances from three different band groups, contrary to the conventional triplet loss.This term pushes the negative data instance away from the positive and the anchor data beyond a margin of α.The second term is added to push the positive and the negative exemplars apart from each other.This is conditional because only if the classes of the positive and negative exemplars are completely nonoverlapping, this part of the loss is to be considered.To make this conditional, we use an indicator function K where K = 1, if P ∩ N = {∅}, and α and β are the two margin values.To extend this to a interband setup, we choose A, P , and N from different band groups.We train the network with all six combination of triads.The distribution gap between the two band groups are further moved apart by using a decoder network.An illustration of the proposed modified-cross-triplet metric learning is provided in Fig. 5 L (A,P,N ) = max( f (A) Selection of positive and negative pairs: We need to feed the network with triads, with inputs from different band groups to train the network.For a multilabeled dataset, it is evident that the number of possible combinations of negative pairs is far more than the number of possible positive pairs.This could lead the network to learn the embedding features with a large intraclass distance in different band groups.To avoid this, it is crucial how we choose the triads during the training process.For this purpose, we feed the network with a similar number of all the six combinations of triads in each mini-batch.The higher euclidean distance classes between their embedding features are farther apart from each other in the shared-embedding space.For example, classes comprising vegetation covers would be farther from classes containing water.Similarly, a few classes are closer to each other in the embedding space, as their mutual euclidean distance is also much smaller.These are classes such as sea, ocean, water bodies, and water courses.These classes are required to fine-tune the boundaries of the representation of instances from each class in the embedding space.
Objective function: In the experiments, we noticed that solely minimizing the triplet loss is insufficient to train the network for such a fine-grained multilabeled dataset.While pushing two different class samples apart, we also need to ensure that the class, which is driven away does get cluttered with some other class.For this purpose, we also add a sigmoid layer to minimize the cross-entropy loss in the network to encode the class information (5).The cross-entropy loss helps maintain differentiating attributes among each class within a group, while the cross-triplet loss aids in bridging the domain gap between the different groups.Algorithm 1 demonstrates the overall working framework ( The overall loss function is the sum of the cross-triplet loss and the cross-entropy loss functions.We refer to the cumulative loss as L and define it as ( 6)

C. Band Group Properties
It is universally acknowledged in RS that different bands of a multispectral sensor are essential in capturing diverse landuse/land-cover regions owing to their absorption characteristics in that band.However, to the best of authors' knowledge, no study has hitherto shown the essential contribution of each of these bands in capturing their band properties.This article uses a modified pairwise-triplet similarity loss-based architecture for interband retrieval while a representation learning networkbased architecture for classifying the different groups.
To study the contribution of each band group in the detection of each land-use/land-cover class, we examine the class-wise precision of each land-cover category from the multilabel network trained using each group individually.Finding the accuracy of recognizing each land-cover class using the three band groups would provide us an insight into the properties of the bands in that group.We throw more light to this in the discussion results and discussion Section V.

A. Experimental Setup 1) Evaluation Metrics:
To evaluate the performance of the multilabel classification for both group-wise and cumulative framework, we use the conventional precision and recall measures.Precision is defined as the proportion of the truly positive to the predicted positives (true positive + false positives).Likewise, recall is defined as the proportion of actual positives that are correctly classified (true positive + false negatives).Conventionally, there is a tradeoff between achieving high precision and a high recall.
2) Parameter Settings: While training the multilabel classification network, we choose the threshold α = 0.5.Since there are many classes, we did not intend to set a very high threshold value.For the interband retrieval network, we again chose a value The bold signifies the best performance. of 0.5 for both the margins α and β.Too high value often leads to dispersed clusters, while too low value does not sufficiently separate two classes.
3) Implementation Details: Following the experimental protocol of [23], we split the entire dataset into 60:20:20 train:val:test split.For all the subproblems, we choose a batch size of 50 for training the network.We select the batches to have at least one image instance of each of the 43 labels in each batch.For training the classification networks, we used a momentum optimizer with a small learning rate of 0.001.We trained the network for about 1000 epochs until the losses converged.We saved the model after every 20 epochs and loaded the best-trained model for the final test.We chose an even lower learning rate of 0.0001 on stochastic gradient descent optimizer for the interband retrieval network.We trained it for about 2000 epochs before saving the best model.We constructed the triplets as mentioned in Section IV-B and ensured that each batch contained at least one anchor image from one of the 43 distinct land-cover classes.Some of the Sentinel-2 image patches contain a considerable amount of seasonal snow.Also, while most of the images in this dataset are selected from regions with less than 1% could cover, the cloud cover is localized within some patches in some cases.This includes a substantial number of patches (13%).We conducted another set of experiments by eliminating these patches from the dataset and refer to this data as BigEarthNet-subset (S2) subsequently.
Table II shows the comparison of various existing works in the literature on the BigEarthNet dataset and the BigEarthNetsubset datasets.Sumbul et al. [57] report their results on two variants of their model, where they train the network using the RGB channels and using all the channels.For our experiments as well, we have reported the performance of both variants.There is a tradeoff between precision and recall values, and maximizing one can lead to a fall in the other.Hence, it is important to maximize precision and recall optimally to attain a high F 1 score.In addition to this, we also compare our results with [23], [24], [59], [60].Although the work in [23] attains a high recall value, its corresponding precision value falls considerable, taking a toll on their F 1 score.Some existing literary works like [17], [58], [61]- [63] utilize variants/subsets of the BigEarthNet dataset with either different experimental protocol or different aim than classification/retrieval tasks (e.g., colorization, noisy label detection, interband retrieval).Hence, to maintain fairness in comparison, we do not directly compare with their results.

B. Band Group Classification Results
We chose the train-test samples randomly to avoid training bias.Typically for classification problems, it is more common to report the results in terms of accuracy.However, in multilabel classification, there comes ambiguity in categorizing a subset of labels or detecting all the labels and a few more incorrect ones.To avoid this, precision, recall, and mean average precision (mAP) values are considered for multilabel classification.We report the performance of the network in Table II in terms of precision at top-10 (P@10) and recall values.Since the all-channel multilabel classification model outperforms the literary works on this data, we can state that the current network is suitable for group-wise classification without loss of generality.Some of the experiments from the literature have reported their performance on only one of the variants of the BigEarthNet dataset; hence, the alternate variant results are unavailable.
Furthermore, from Table II, we see that the classification performance using just the 20 × 20 pixel (60 m) band group yields inferior results.This is primarily because there are too few bands in this group.Moreover, the spatial resolution of the bands is very low.This also majorly affects the classification performance of the images.Finally, and most importantly, this group comprises the coastal aerosol and the water vapor band.If we study the land-cover classes in this dataset carefully from Fig. 3, there are a few land-cover categories that we can distinctly recognize using this band group.The classes that gave the highest precision using this band group are burnt areas and continuous/discontinuous urban fabrics.
The classification performance using just the 60 × 60 pixels/ 20 m spatial resolution band group yields slightly inferior results to the 120 × 120 pixels/10 m band group.Even though this group has the most number of bands, i.e., six, the quantity of information contained in these bands seems lesser than 10 m bands.This band group comprises the four vegetation red edge bands and two short wave IR bands.From Fig. 3, it can be seen that there are plenty of vegetation and forest cover classes in the dataset.The M2 band group can classify within these classes much more robustly than the other groups.However, when it comes to the other nonvegetation cover classes, such as airports, salines, burnt areas, to name a few, the classification performance is drastically affected.The classes that gave this group the highest precision are coniferous forests, nonirrigated arable lands, and mixed forests.Typically, mixed forests are a combination of coniferous and broad-leaved deciduous forests.
Likewise, M3 was seen to classify the best among the other two.One obvious contributing factor is its high spatial resolution of 10 m (120 × 120 pixel images).Although this group has fewer bands (four) than M2 (six bands), the effective information content spanning different classes is higher.The intuition is that the overall interclass distance between the broad classes in the embedding space is much higher than the other groups.However, the distinction between the finer classes, such as broad-leaved forests, mixed forests, and coniferous forests, is not high and seems cluttered.This group mainly comprises the RGB and the infrared bands.We can see from Table II that this band group alone can yield more or less comparable performance to that of the full data model.The classes that gave the highest precision using this band group are again coniferous forests and mixed forests.The road and rail networks are also detected well using this band group, which provides crisp, sharp, and detailed images due to its high spatial resolution.Water courses are also detected very well using this group.
Fig. 6 illustrates the model performance with 10%, 50%, and 100% of the training data, obtained by stratified random sampling.We plot the precision values along the vertical axis and plot the model performance with all the RGB channels, M1, M2, and M3 band groups.

C. Cross/Interband Retrieval
In this set of experiments, we aim to realize a shared embedding space for the instances of all three band groups.The shared features are designed to be discriminative while reducing the intergroup domain gap.We do so by minimizing the overall objective function 6.Given a query image from any three groups, we can find the k-nearest neighbours to that query feature from the target band group.Table III reports the results of this interband retrieval in terms of P@10 and mAP values on the  III shows the ablation along with each of these three losses and highlights the advantage of the proposed modified cross-triplet loss for interband data retrieval for multilabeled images.
One can see from Table III that the conventional triplet loss is not able to bridge the domain gap between the two band groups.In contrast, the cross-triplet loss can handle retrieval among different bands by bridging the band groups.However, for this multilabeled dataset, the proposed modified cross-triplet loss outperforms the other losses by a margin of almost 2% in most of the results.This helps to highlight the efficacy of the proposed loss.
We observe that the interband retrieval performances between the M2 and M3 band groups are the highest.Intuitively, this is due to the high spatial resolution and information content of the two groups.It is also an important observation that when the query has higher information content (M3), the retrieved instances from a lower information content group (M2) are better than the other way round.

D. Understanding Band Properties
As mentioned in Section IV-C that to study the contribution of each band group for the categorization of each land-use/landcover class, we examine the class-wise precision of each category.The experiments were conducted on the BigEarthNetsubset dataset devoid of cloud, shadow, and snow cover.The class-wise precision values obtained on the subset data is provided in terms of bar plots in Figs.7-9.We can see that M3 yields better a class-wise classification performance than M2 and M3.This section briefly discusses the observations by classifying the classes into three categories: agro-forestry, water bodies, and human-made areas.In addition, it can also be seen from the above plots that we have successfully addressed the training data imbalance characteristics.Moreover, the lower number of instance classes perform adequately.Agro-forestry regions: We identified 20 classes, which are either of the agricultural or forest type classes.We group them together and thoroughly inspect them in this section.Fig. 7 shows the bar plot of the class-wise precision obtained in each group.Theoretically, the vegetation red edge bands in this group help calculate NDVI and help indicate chlorophyll and, hence, found helpful to distinguish between healthy and unhealthy vegetation.Studies have also shown that the presence of bands 5 and 6 assists in obtaining the biophysical properties of vegetation, such as leaf area index (LAI) and biomass [8].Band 2 (blue) often finds application  in discriminating between coniferous and deciduous forests.Experimentally, we observe the following.
1) Certain vegetation classes seem to be better discriminated in M2 than M1, e.g., coniferous forest, transitional woodland/shrub, broad-leaved forest, etc.Likewise, certain forest classes are also better discriminated in M2 than M1, e.g., nonirrigated arable land, significant natural vegetation, to name a few.2) The abovementioned observations strongly indicate that the finer-grained agro-forestry classes are more discernible in the feature space of the M2 group than M3.The spectral signature along the six bands helps establish the category of regions, the spatial texture, and geometrical patterns also play an essential role in classifying these classes.
3) It is observed that in M2, most of the vegetated surfaces are retrieved with a precision ≥ 60%.This observation also partially confirms that most of these vegetation covers have proper irrigation and drainage facilities in healthy conditions.4) The relatively lower performance of the agriculture and natural vegetation class could be because of its confusion with the complex cultivation pattern class.During the time of data acquisition, possibly both these regions had cultivated crops.However, their spatial microtextures and geometrical patterns help to quite an extent in making them somewhat discernible.5) Mixed and coniferous forests show very high accuracy in both M2 and M3.While the primary reason behind this is the significant number of training samples available for these classes, the unique leaf structure of their respective categories also makes them discernible.The coniferous forest predominantly comprises cone-bearing needle-leaved evergreen trees.Mixed forest, on the other hand, is a combination of coniferous and broad-leaved deciduous forests.This property degrades the capacity to discern broad-leaved forests from mixed forests as there is ample confusion between the two.Common factors affect the broad-leaved forest class as it has considerably lower training samples than the former.Water bodies: For this category, we select ten classes that broadly consist of water.We group them together and thoroughly inspect them in this section.Fig. 8 shows the bar plot of the class-wise precision obtained in each band group.Theoretically, band 2 (categorized under M3) is in the visible blue region and finds several applications in bathymetric mappings, which involve the study of underwater depth of ocean floors or lake floors [67].While M3 seems superior in classifying these classes, a few red edge vegetation bands (under M2) are sensitive to moisture content and helps in calculation the moisture index.Experimentally, we observe the following.
1) It is majorly expected that one can detect water bodies better with the M3 group.This view is supported theoretically by the high reflectance of the water body in the blue wavelength region and near-complete absorption of the other wavelengths.Although most of the classes follow this trend, there are a few classes where M2 performs better than M3.
2) The differences in precision values obtained using M2 in certain water body classes is assumed to be caused due to coastal algal bloom [68] and the presence of low to moderately dense chlorophyll content due to phytoplankton population [69].The presence of phytoplankton is responsible for an increase in the reflectance in the red edge region, making the spectral signature of the class different from regular water bodies.3) An interesting sub-class under the water-bodies is coastal lagoons.The coastal lagoons are essentially transitional zones between land and sea.They are shallow inland water bodies that are intermittently connected to oceans, blocked by land barriers.Therefore, they contain the spectral signatures of both land and water.The precision of this class is observed to be relatively lower than the other classes.This observation is assumed because of its confusion with the individual irrigated lands and water bodies.Besides, there also exists the dilemma of turbidity and eutrophication of water, which is a very common ecological phenomenon [70].4) Estuaries, Intertidal flats, and Salines comprise too few training samples, and hence their performance takes a hit due to the largely imbalanced data classes.Man-made regions: For this category, we select 11 classes that broadly consist of urban areas.We group them together and carefully inspect them in this section.Fig. 9 shows the bar plot of the class-wise precision captured in each band group.Humanmade areas are assumed to be captured well by the M3 group as it has a very high spatial resolution 10 m [71].Experimentally, we observe the following.
1) Human-made structures can be captured well by high spatial resolution bands [71].This nature helps the M3 band group of spatial resolution 10 m in providing superior results.2) In certain regions like the port areas and construction sites, the greenness value is close to 0. Hence, its corresponding reflectance in the red edge bands is also close to 0. Due to this reason, M2, comprising of red edge bands, cannot identify these regions successfully.3) M1 consists of the coastal aerosol and the water vapor bands.The presence of specific disposed or waste material causes the aerosol bands to help detect dump sites.It is also our assumption that the port areas and construction sites from which the dataset was obtained consisted of specific amounts of fly-ashes and aerosols [72], which lead to a better assessment of these classes in this group.
4) Surprisingly, continuous and discontinuous urban fabrics were also captured well by this band group.This could be because these urban areas were high on particle pollutants [73].Overall, it is observed that M3 has lesser class-wise precision variation than M2, which has a high sensitivity to the agricultural and forest areas.It is much lower for urban, road, and different human-made targets.We can observe from the plots that the precision mainly falls over nonvegetated areas (in M2).The literature affirms that the bands B5, B6, B7, B8A are sensitive to the greenness of vegetation, while B9 and B11 are sensitive to the moisture content, and B12 is excellent at detecting geological features.Hence, in certain human-made classes such as industrial, sports, leisure facilities, road, rail networks, bare rock, mineral extraction sites, construction sites, airports, dump-sites, port areas, and burnt areas, the amount of greenness is little to almost negligible.This is because of their nearly nonexisting vegetation canopy.These areas have intrinsic sharp patterns and are captured well in the M3 group with high spatial resolution.Therefore, while M2 classifies the fine-grained vegetation and forest classes, M3 classifies the overall broad spectrum classes more robustly, such as airports, salt marshes, rail and road networks, bare rock, and coastal lagoons.
There exists a substantial class-wise data imbalance in the dataset.The class-wise precision subplots are arranged to decrease the number of samples present from left to right.It can be noted from Figs. 7-9 that the classes having fewer samples gradually show decreasing results as one could not learn them adequately.Also, another critical contributory factor in the multilabel classification performance is that certain classes appear in combinations throughout the dataset.Hence, the classification performance of at least one of the distinct subclasses of the multilabeled instances often ensures the classification of the other classes in that instance.This phenomenon affects the performance of the model immensely in conventional multilabel classification.

VI. CONCLUSION
We exploit a simplistic yet efficient representation learning network for multilabel classification that yields superior results to the existing literature.We then group the bands of Sentinel-2 multispectral data based on their spatial resolutions.The effectiveness of any classification or retrieval problem primarily depends on the texture information in the data and the existence of pure end-member within a pixel.With an increase in the spatial resolution, the possibility of detecting a pure end-member in a pixel reduces due to the high chance of nonlinear mixing of several other end-members within the resolution cell.
From these grouped bands, we demonstrate the identifiability of each of the land-cover classes in all these band groups.We further interpret the observations from the abovementioned framework and study the band properties of this multispectral RS data.The BigEarthNet dataset was created by exploiting the labels from the Corine land-use/land-cover classes.This results in the presence of mixed classes in the patches.Our experiments have supported this observation and are discussed in detail by exploiting agriculture and forestry domain knowledge.In addition, it can also be seen from the abovementioned results that we have successfully addressed the training data imbalance part, and even the lower number of instance classes perform decently well.
Furthermore, we introduced a novel modified cross-triplet loss for multilabeled data for metric learning and established its efficacy over the conventional triplet loss.The standard triplets do not consider the possibility of having a common subset of classes between the positive and negative examples of a multilabeled anchor image and, therefore, can spread apart the intraclass distances.In the proposed loss, we consider this fact and observe a distinct improvement in the overall results.
We thoroughly demonstrate our classification and retrieval results on the large-scale benchmark BigEarthNet data.We show that the proposed framework outperforms the current literature in all the evaluation metrics to validate our claim.In the future, we would like to extend this study to investigate the effect of a weighted grouping of bands and a learnable band selection process.We also plan to perform a few case studies using these specific band groups from forest-fire and snow/cloud detection problems and study their efficacy.

Fig. 3 .
Fig. 3. Land-cover classes versus the number of image patches of each of these categories.Plot distribution highlights heavy data imbalance.
split the bands according to their spatial resolutions (viz., 10 m, 20 m, and 60 m), and analyze their contributions to segregate different land-cover classes.The database contains image patches of size: 1) 120 × 120 pixels in the 10 m bands; 2) 60 × 60 pixels in the 20 m bands; 3) 20 × 20 pixels in the 60 m bands.

Fig. 4 .
Fig. 4. Overall pipeline of the proposed interband retrieval framework using the modified-triplet function and cross-entropy loss.The inference phase from the synthesized shared latent space is shown in the right.

Fig. 7 .
Fig. 7. Class-wise precision of the three groups for agro-forestry classes.

Fig. 8 .
Fig. 8. Class-wise precision of the three band groups for various water-body classes.

Fig. 9 .
Fig. 9. Class-wise precision of the three band groups for various man-made regions.

TABLE I EXAMPLES
OF FEW TYPICAL APPLICATIONS OF EACH SPECTRAL BAND OF SENTINEL 2

TABLE II MULTILABEL
CLASSIFICATION PERFORMANCE OF THE PROPOSED FRAMEWORK

TABLE III INTERBAND
RETRIEVAL PERFORMANCE OF THE PROPOSED FRAMEWORK dataset.Table