Facing the Void: Overcoming Missing Data in Multi-View Imagery

In some scenarios, a single input image may not be enough to allow the object classification. In those cases, it is crucial to explore the complementary information extracted from images presenting the same object from multiple perspectives (or views) in order to enhance the general scene understanding and, consequently, increase the performance. However, this task, commonly called multi-view image classification, has a major challenge: missing data. In this paper, we propose a novel technique for multi-view image classification robust to this problem. The proposed method, based on state-of-the-art deep learning-based approaches and metric learning, can be easily adapted and exploited in other applications and domains. A systematic evaluation of the proposed algorithm was conducted using two multi-view aerial-ground datasets with very distinct properties. Results show that the proposed algorithm provides improvements in multi-view image classification accuracy when compared to state-of-the-art methods. Code available at \url{https://github.com/Gabriellm2003/remote_sensing_missing_data}.


Introduction
Standard image classification tasks are trained by using a single data point as input.However, in some cases, using only one input image is not enough to allow its categorization.One reason for this is the perspective of the object presented in the image, which may not present enough information to allow its identification.For example, aerial images allow us to observe objects from above, providing information about the general shape and structure of the objects and facilitating the classification of objects such as bridges and streets.On the other hand, ground images give us a closer and frontal view of the object, providing information about fine details and helping the recognition of, for instance, statues and specific facade buildings (such as schools).
In this context, researchers noticed that it would be essential to exploit the complementary information extracted from images depicting the same object from multiple perspectives (or views) in order to enhance the general scene understanding and, consequently, increase the performance.This important task, commonly called multi-view image classification, has been successfully explored for distinct applications, including geo-localization [1], mammography analysis [2], and land use mapping [3].However, although essential and impactful, such task has a major challenge: missing data.When working with multi-view data, it is really common to have one or more views missing due to malfunction of a sensor, noise, or simply lack of data.This is even worse when considering that multi-view samples with missing data are often discarded entirely, resulting in a severe loss of available information, as presented in Figure 1.This issue is especially relevant for domains in which it is difficult to obtain annotated multi-view samples, such as the medical and remote sensing ones [4].
In this paper, we propose a novel framework for multi-view image classification capable of dealing with missing data.Technically, this method can be divided  into two parts.In the first one, a retrieval network, trained using metric learning, receives an input instance and retrieves samples that can be used to fill its missing data gap.Then, in the second part, information extracted from both the input instance and the top-k retrieved images are further processed using state-of-theart deep learning-based approaches and then late-fused using standard algorithms in order to perform the final classification.We evaluate the proposed framework on two 2-view (aerial and ground) datasets from the literature, achieving state-of-the-art results.Our methodology, however, can be easily expanded for more complex scenarios with more than two views.The paper is structured as follows.Related works are presented in Section 2 while the proposed technique is introduced in Section 3. Section 4 presents the experimental protocol and Section 5 reports and discusses the obtained results.Finally, in Section 6 we conclude the paper and point at promising directions for future work.

Related Work
Although several methods have been proposed to tackle multi-view image classification [4,5], only a few works have investigated and conceived approaches to handle missing data [6,7,8,3,9,10].
Zhang et al. [6] proposed a feature-level completion method for missing view of multi-view data.Technically, their approach first linearly maps multi-view data to a feature-isomorphic subspace, unfolding the shared information from different views.Then, features of this isomorphic space are used to train a model, which is responsible for retrieving features to represent the missing view and, consequently, completing the multi-view data.In [8], the authors proposed a multi-modal classification framework, called EmbraceNet, that is robust to missing data.This framework uses a multinomial distribution to select the most relevant features of each view, which in order to make this model robust to the partial absence of data, they readjust the multinomial distribution to select features only of the existing modalities.By doing this, they argue that the missing information due to data loss of a modality can be covered by the other modalities.Finally, Srivastava et al. [3] proposed a two-stream network that extracts discriminative features from aerial and ground images and then combines them for the final classification.In order to make their approach robust to missing data, they used these discriminative features to create an embedding space (using Canonical Correlation Analysis (CCA) [11]), which is exploited to retrieve samples that would be similar to the missing data and thus complete the data.
More recently, Generative Adversarial Networks (GANs) [12] have gained popularity in the missing data completion field, due to their ability to generate synthetic samples.One of the first approaches to investigate the use of GANs for missing data completion was proposed by Cai et al. [7].In this work, the authors proposed a conditional GAN [13] that takes data from one view as input and generates synthetic samples of the corresponding missing modality.These data (real and synthetic images) are then used as input to a (discriminator) network which is responsible for the final classification.Lee et al. [9] introduced a multi-modal framework, called CollaGAN, for handling missing data imputation.Precisely, this framework converts the missing data imputation problem into a multi-domain image-toimage translation task.In this way, a single GAN network can successfully generate the missing data using the remaining (complete) data set.Another GAN-based work was proposed by Aversano et al. [10].In this work, a multi-branch network jointly encodes data of different modalities/views, generating a common embedding space.Feature representations of this space are then used in the generation of synthetic data (using the generator network), thus performing data imputation and completing the problematic sample for further processing.
In this work, we propose a multi-view image classification framework that uses a retrieval Convolutional Neural Network (CNN) to handle missing data.This network was trained using deep metric learning/crossview matching.Our framework is capable of recovering samples from a different domain just by using the data from the available view.Several differences may be pointed out between the proposed approach and the aforementioned works: 1. Differently from [13,9,10], the proposed technique does not generate synthetic data for missing samples.Compared to retrieval, the task of generating images for one point of view using as input a completely different view of the same object is quite harder [14], and might introduce more bias.2. Our framework handles missing data at the input level, instead of the feature level [6, 8] 3. Besides, both [3] and our method use an embedding space to retrieve samples, unlike [3], our method can learn specific features that optimize the class distribution in the embedding space.This is because our model is trained with deep metric learning, instead of using classification features to optimize a CCA model [11].

Methodology
The proposed multi-view data classification approach, presented in Figure 2, can be splitted into two parts.In the first one (the retrieval part), a multi-view sample with missing data is processed using a retrieval network responsible for ranking the images (of an auxiliary database composed of scenes of several classes but from the same domain of the missing data) based on their potential to be paired with this input data.The original image and the generated ranking are then forwarded to the second part, i.e., the multi-view classification.In this step, retrieved images and original input are processed using their corresponding networks and then fused to produce the final classification.Observe that instead of pairing the input image with only the best retrieved image, we select the top-k images in order to alleviate any potential bias and improve generalization.
More details on each of those components, i.e., the retrieval and the classification parts, are presented in the next sections.

Retrieval
As introduced, the retrieval network is responsible for recovering images that could potentially be used as pair for an input example with missing data.
Technically, such a model receives, as input, multiview image pairs, and is trained using the weighted softmargin triplet loss [15], in which the main objective is to pull these input pairs (commonly called anchor and positive samples) close together whereas pushing dissimilar data (i.e., negative examples) apart.Based on previous works [16,17,18], the negative instances are mined from the batch using the exhaustive mini-batch strategy [19], which proposes to use all other samples (excluding the current anchor and positive pairs) as negative examples.Due to this, each batch must have only one image pair per class, in order to ensure that samples from the same class will never be used as negative examples [20].
Formally, suppose the model receives, as input, a batch B = {b 1 , b 2 , ..., b C } composed of C elements, one of each class, and in which each element is actually a multi-view image pair b The network first extracts the features for each image pair and then uses these features to create a matrix of distances α between all images, as presented in Equation 1.
Given the matrix of distances α, it is possible to easily calculate the distance between anchors and positive samples (d ap = α i,i ), and between anchors and negative examples (d an = α i, j ∀ i j), and then use those to optimize the model following the aforementioned weighted soft-margin triplet loss [15]: where γ is a hyper-parameter that controls the loss convergence [15].
It is important to highlight two main aspects of the training procedure: (i) instead of using the exact pair of images as input, random pairs within the same class are employed in order to increase the robustness of the model to missing data (given that, during the testing phase, there will be no corresponding pair to the query image due to missing data); (ii) for each input image pair, the model is optimized considering one as an anchor and the other as a positive sample and also viceversa.By doing this, we allow the model to be prepared for missing data from all possible domains.
During the inference, a multi-view sample with missing data (also called query) is paired with other images (from a previously established database composed of data from the missing domain) that could potentially fill this missing data gap.All pairs are then processed by the retrieval network and ranked based on their similarity (i.e., on their distance).All images and the ranking are then forwarded to the classification network to be further processed, as explained in the next Section.

Classification
In the second step of the proposed methodology, the input image with missing data is finally classified.Towards this, first, the top-k most similar images are selected based on the aforementioned ranking.Such images are employed to fill the missing data gap of the  original query image.The idea of using the top-k images is to alleviate any potential bias and improve generalization, given that using only one could produce under-representative multi-view pairs whereas using all images could generate instances composed of scenes from different classes.Query and top-k images are then processed using networks trained specifically for their domains.Predictions σ for the top-k images are fused using the mean operation [21]: Finally, the predictions for the query image (σ f 1 ) and the merged predictions of the top-k scenes (σ f 2 ) are latefused using the product operation [21], thus producing the final classification: Observe that both mean and product fusion strategies were selected based on previous works [22,21].

Implementation Details
The architecture of the retrieval network consists of a two-stream encoder, in which each encoder is actually a pre-trained SwAV [23] model with ResNet-50 [24].Precisely, this architecture has no classification layers and is trained using the aforesaid weighted soft-margin triplet loss [15].
For the inference, instead of using the retrieval network to extract features for all database images for every input query (a costly and time-consuming process), we extract, save, and reuse the features of those database images in order to speed up the testing phase.
For the classification part, we evaluated different Convolutional Networks (VGG [25], DenseNet [26], and SKNet [27]).More details about this, as well as about the employed hyper-parameters, can be seen in Section 4.2.
Lastly, a few adjustments to the methodology can be made to apply our framework to scenarios with N views (with one of them missing).To maintain the use of weighted soft-margin triplet loss and consider all the features from the available views, Equation1 could use the average of these features to compute the distance against samples from the missing view.Also, the architecture of the retrieval network must be changed to an N-stream encoder to support N-views inputs.Finally, for the classification step, it is necessary to train N classifiers (one per view), and Equation 4 must be adjusted to late fuse the prediction scores of all N views (instead of 2).

Experimental Setup
In this section, we describe the experimental setup used for the experiments.Section 4.1 presents the datasets whereas Section 4.2 describes the experimental protocol.Finally, baselines are described in Section 4.3.

Datasets
Two multi-view datasets with very distinct properties were used for the experiments in order to better evaluate the effectiveness of the proposed approach.The first dataset, called AiRound [21], is composed of 11,753 images divided into 11 classes: airport, bridge, church, forest, lake, park, river, skyscraper, stadium, statue, and tower.Some examples of these classes are presented in Figure 3.The second one, named CV-BrCT [21], comprises approximately 24k pairs of images unevenly split into 7 urban classes: apartment, house, industrial, parking lot, religious, school, store.Samples of these classes are presented in Figure 4.
For both datasets, each multi-view sample is composed of a ground and an aerial perspective.Images for the former domain were collected from different sources (such as Google Images, Google Places, Google Street View) and have varying resolutions whereas scenes for the latter perspective were collected using Google Maps and have a fixed resolution of 500 × 500 pixels.

Experimental Protocol
For both datasets, we employed a 5-fold crossvalidation protocol, in which 80% of the images are used for training, 10% for validation, and the remaining 10% for testing.Following this protocol, we simulated and evaluated two different missing data scenarios using the test set: one for aerial and one for ground.
Considering this, the retrieval network (described in Section 3.3) was trained using the following hyperparameters: 200 epochs, batch size equal to the number of classes of the dataset (i.e., 11 for AiRound and 7 for CV-BrCT), γ (Equation 2) of 10, Adam [28] as optimizer, learning rate of 0.00001, and exponential decays of 0.9 and 0.999.Results related to this model are reported in terms of the average mean Average Precision at K (mAP@K) [29], taken from all 5-fold experiments with its corresponding standard deviation.Additionally, it is important to highlight that, during the inference, this network used the validation set as the retrieval database, i.e., images of the validation set are retrieved and then paired (based on their similarity) with the test samples for the final classification.
For the classification part, we evaluated three wellknown architectures: VGG [25], DenseNet [26], and SKNet [27].All networks were fine-tuned (from the ImageNet [30] dataset) for each of the domains (aerial and ground) using the following hyper-parameters: 200 epochs, early stop with 20 epochs, batch size of 32, stochastic gradient descent as optimizer, a learning rate of 0.001, and momentum of 0.9.In this case, all obtained results are reported in terms of the average F1-Score and standard deviation among all 5 folds.

Baselines
Four techniques were considered as baselines for both datasets.The first baseline, referenced as "No Fusion", consists of using a single-view classification CNN evaluated in the data available in the test phase (without any completion).The idea of this baseline is to establish a lower bound for the other experiments.The second baseline, referenced hereafter as "Fully-Paired", performs final classification assuming that the test set has no missing data (i.e., that it is fully paired).To do so, this baseline uses two classification CNNs to extract features from both aerial and ground domains which are then fused (using the product fusion presented in Equation 4) to produce the final result.As for the second baseline, the idea is to set an upper bound for further experiments.
The remaining baselines come from the literature.One of those is the EmbraceNet [8], a multi-view framework that learns a multimodal distribution to select the most relevant features of each view.This distribution can be adjusted depending on the availability or absence of certain domains, thus being able to handle missing data.Based on such framework, we proposed a two-stream network in which the final layer is actually an EmbraceNet layer [8], capable of efficiently dealing with missing data.Such network was trained using the same set of hyper-parameters used for the classification models (Section 4.2).
The last baseline is the Canonical Correlation Analysis (CCA) [3].Such an approach learns matrices that project the features from different views into a common latent space, bringing closer the varying perspectives of the same object [3].Using such projection matrices (learned from the training data) we can project the available view of the test set into the common latent space and then retrieve the closest scene to finally fill the missing data gap.Following the guidelines introduced in the original work [3], we first project the features extracted from the last layer before the classification using a Principal Component Analysis (PCA), and then use them as input for the CCA [3].

Results
In this section, we present and discuss the obtained results.Section 5.1 presents the results related to the retrieval part whereas Section 5.2 evaluates and compares  different configurations for the classification part of the proposed framework.Finally, Section 5.3 compares the proposed framework with state-of-the-art baselines.

Retrieval Analysis
In this Section, we analyze the retrieval network, a main component of the proposed framework.Precisely, Figure 5 reports the retrieval results, in terms of mAP@K [29], for both AiRound and CV-BrCT datasets.
Analyzing the results, it is possible to observe that the retrieval network tends to produce better outcomes when using aerial data as input query (i.e., when the ground view is missing).This may be justified by the fact that most aerial images tend to provide more context information than ground scenes to the retrieval model, that in turn exploits such useful data to recover similar images.Another important aspect to discuss is the fact that, in general, the top-1 image does not produce the best results in terms of mAP.This corroborates to our initial analysis about the potential of using topk (instead of top-1) images to fill the missing data gap and, consequently, produce better classification results.A better discussion about the use of top-k images to fill this missing data gap is presented in the next Section.

Classification Analysis
In this Section, we analyze the impact of the different classification networks (VGG [25], DenseNet [26], and SKNet [27]) and investigate the influence of the ranking size (i.e., top-k) in the final performance of the framework considering two distinct missing data scenarios (described in Section 4.2), i.e., one for aerial and one for ground.Obtained results for the AiRound and CV-BrCT datasets are presented in Tables 1 and 2, respectively.For both datasets and assessed scenarios, the strategy of combining information from multiple images to fill the missing data gap produced better outcomes than using just the top retrieved image (an approach commonly  exploited in the literature [6,3]).This directly corroborates with our initial analysis about the potential of combining information from the top-k images to fill the missing data gap.Aside from this, for the AiRound dataset, the best result for the aerial missing scenario was produced by the SKNet model [27] whereas the best outcome for the ground missing scenario was yielded by the DenseNet network [26].In both cases, the ranking size of 100 (i.e., the top-100 images) yielded the best outcomes.For the CV-BrCT dataset, all assessed models achieved very similar results when using a ranking size greater than or equal to 3 for the aerial missing scenario, and greater than or equal to 2 for the ground missing scenario.Given this, the simplest model, i.e., VGG network [25], with ranking size 100 was selected and used for further experiments using this dataset.

State-of-the-art Comparison
This Section compares and discusses the results obtained by the proposed framework and the state-of-theart baselines.Observe that: (i) based on the analyses carried out in the previous Sections, only the best results for each dataset and scenario are presented.(ii) the baselines were conceived using the same network selected for the classification part of the proposed framework.Precisely, for the AiRound dataset, baselines are based on the SKNet model [27] (for the aerial missing scenario) and on the DenseNet [26] (for the ground missing scenario).For the CV-BrCT, all baselines were conceived based on the VGG network [25].(iii) all results presented here were verified by a fold-by-fold paired t-test with confidence level of 95%.

Method
Available Data F1 Score No Fusion (Lower bound) Ground (Aerial Missing) 0.76 ± 0.00 EmbraceNet [8] 0.69 ± 0.03 CCA [3] 0.75 ± 0.02 Ours 0.79 ± 0.00 No Fusion (Lower bound) Aerial (Ground Missing) 0.83 ± 0.00 EmbraceNet [8] 0.83 ± 0.00 CCA [3] 0.83 ± 0.01 Ours 0.85 ± 0.01 Fully-Paired (Upper bound) Aerial + Ground 0.91 ± 0.01 Tables 3 and 4 present the results for the AiRound and CV-BrCT dataset.Overall, the proposed approach outperformed all baselines except, as expected, the upper bound one.In fact, for the CV-BrCT dataset, the difference between the results achieved by the proposed framework and the fully-paired (upper bound) baseline for the ground missing scenario is almost irrelevant.However, for all other scenarios and datasets, this difference is considerable, which shows that there is still room for improvements.

Method
Available Data F1 Score No Fusion (Lower bound) Ground (Aerial Missing) 0.72 ± 0.01 EmbraceNet [8] 0.53 ± 0.04 CCA [3] 0.72 ± 0.02 Ours 0.73 ± 0.02 No Fusion (Lower bound) Aerial (Ground Missing) 0.86 ± 0.01 EmbraceNet [8] 0.77 ± 0.03 CCA [3] 0.86 ± 0.01 Ours 0.88 ± 0.01 Fully-Paired (Upper bound) Aerial + Ground 0.89 ± 0.02 Aside from this, it is interesting to observe that, for both datasets, the obtained results for the aerial missing scenario are worse than the outcomes for the ground missing scenario.As previously explained, this may be justified by the fact that most aerial images tend to provide more context information (than ground scenes), thus assisting in the classification process and, consequently, yielding better results.

Conclusions
In this paper, we propose a novel framework to handle multi-view image classification with missing data.The proposed approach is composed of two main parts: (i) a retrieval one, responsible for recovering similar samples that can be used to fill the missing data gap of the input query image; and (ii) a classification one, that fuses information extracted from both the input image and the top-k retrieved scenes in order to perform the final classification.
Experiments were conducted using two multi-view aerial-ground datasets (AiRound and CV-BrCT) and considering two distinct scenarios: one in which aerial images are considered absent and another where ground scenes are considered missing.Results have showed that the proposed technique is efficient and robust.Precisely, it achieved state-of-the-art results in both datasets outperforming all baselines, except for the fully paired (upper-bound) one.Additionally, experimental outcomes showed that the strategy of combining information from multiple (i.e., top-k) images to fill the miss-ing data gap (instead of using just the top retrieved image, as commonly employed in the literature [6,3]) is remarkably effective.
As future work, we intend to evaluate the proposed approach using other datasets with more (than two) views per instance.We also would like to assess other retrieval methods as well as other classification networks.

Figure 1 :
Figure 1: Example of the impact of missing views in the multi-view image classification.A missing view actually represents an instance that can not be used for training, i.e., it represents a severe loss of available information that could be used for the learning process.

Figure 2 :
Figure2: Pipeline of the proposed approach applied in a two-view scenario, where aerial images are missing.First, a retrieval network is employed to rank images (of an auxiliary database composed of scenes from the same domain of the missing data) based on their potential to be paired with the input sample with missing data (Step 1).Then, the top-k retrieved images and the input data are processed and fused, generating the final classification (Step 2).Following a similar logic, this methodology can be applied to different scenarios.

Figure 5 :
Figure 5: Results, in terms of mean Average Precision (mAP) and ranking size (K), of the retrieval network for the aerial and ground missing data scenarios in AiRound and CV-BrCT datasets.The shaded areas represent the standard deviation across the folds.

Table 1 :
Obtained results achieved by the proposed method for the AiRound dataset.

Table 2 :
Obtained results achieved by the proposed method for the CV-BrCT dataset.

Table 3 :
Results of the proposed method and baselines for AiRound dataset.

Table 4 :
Results of the proposed method and baselines for CV-BrCT dataset.