Introduction
Thanks to the technology of delivering multiple satellites in a single rocket launch, thousands of earth observation satellites were sent into space recently. The satellites capture UHD aerial photos (typical resolutions over
In computer vision, dozens of shallow/deep visual categorization/annotation models were designed to describe aerial photos with regular resolutions (typically
Actually, a high-resolution aerial photo typically has lots of objects distributed with various spatial configurations. Accurately uncovering the underlying semantic features is nontrivial. Possible challenges include: i) computationally and spatially modeling the complicated layout of the ground objects, and ii) formulating a deep model converting the spatially modeled features to fixed-length visual features. Moreover, spatially transforming different layouts inside each UHD aerial image into some traditional classification model [8] is a true difficulty;
The large number of objects within each UHD aerial photo makes it labor-intensive to accurately annotate all the ground objects. Because of the progresses in weakly-labeled feature engineering, we solely require image-level labels are required for calculating semantics at region-level. Therefore, to exploit the regional-level visual semantics within a UHD aerial photo, it is necessary to uncover the corresponding weak and user-defined labels. Noticeably, the above user-defined labels are subjectively defined in practice. And sometimes they are noisy. In practice, establishing such a noise-tolerant refinement mechanism is difficult;.
To build a powerful UHD aerial photo categorization framework, we have to model the inherent sample distributions on manifold precisely. Practically, however, because of the contaminated user-defined labels, our previously calculated sample distribution is usually imperfect. Practically, what we need is a mathematical model that adaptively learns the optimal sample interaction when refining the labels. Actually, formulating a solvable and learnable framework requires domain experiences.
In our work, we propose a deep multi-clue matrix factorization (DMCMF) framework for multi-label UHD aerial photo categorization. The core technique is an enhanced MF which hierarchically converts the graphlets within a UHD aerial photo to corresponding binary features. Herein, the potentially noisy labels can be theoretically abandoned and the data graph can be progressively optimized. The entire pipeline of the proposed method is displayed in Fig. 1. In detail, for massive-scale UHD aerial photos, each contains user-defined labels that are potentially noisy, a succinct set of object-guided image regions are extracted in the first place. Thereafter, a rich set of object patches that spatially neighboring are collected to build different graphlets. They can accurately encode the various topologies inside multiple UHD aerial images. Based on this, we build a matrix factorization (MF) algorithm that is robust to label noises. The MF can effectively convert graphlets to the binary feature accordingly. Thereby, we can compare two graphlets rapidly and mathematically. Our proposed MF can optimally encode four descriptive visual clues. By leveraging the binary vectors calculated from different graphlets, we convert graphlets within a UHD aerial image to the kernel-induced feature vector. In this way, an effective classifier is learned to classify different aerial images into the corresponding classes. Plenty of quantitative comparisons to the well-known deep classification models indicated the competitiveness of the learned classification model.
The pipeline of our proposed noise-robust binary matrix factorization (MF) framework for UHD aerial photo categorization.
The main contributions can be summarized in the following: 1) a million-scale partially mislabeled UHD aerial photos collected from 100 metropolises for validating the superiority of our method, 2) a DMCMF that collaboratively and seamlessly incorporates four clues to compute the hash codes of each graphlet; and 3) a novel UHD aerial photo categorization model that avoids noisy image-level labels and adaptively updates the data distribution.
Organization of the remaining parts of article is given as follows. In Sec II we review the published work closely related to ours. Sec III delineates the proposed pipeline, including graphlet construction, our enhanced MF, and the kernel-induced feature learning. Experimental validation in Sec IV tests the effectiveness of our method. The last section concludes.
Related Work
Dozens of computational visual models were developed to analyze aerial photos.1 To semantically model the entire image, Zhang et al. [9] constructed a novel topological feature to model the inter-region connection inside each aerial photo. And a kernel-induced vector is calculated as the image representation for categorization. Xia et al. [10] formulated a weak learning model that semantically labels HR aerial photos at image-level. Akar [11] carefully combine the so-called random forest and object-level feature extractor to classify remote sensing images. Sameen et al. [12] developed a hierarchical deep architecture to calculate the multiple labels of HR aerial photos describing many downtown areas. In [13], Cheng et al. utilized a pre-trained deep CNN to classify high-resolution remote sensing images. A domain-specific scenic picture set is leveraged to fine tune the deep architecture. In [14], a cross-modality learning framework is proposed to collaboratively learn five deep models for categorizing aerial images, wherein pixel-level and spatial-level features are exploited complementarily. The authors [15] designed a novel inter-attentional algorithm to learn the weights of aerial image features both horizontally and vertically. In [16], Bazi et al. formulated a vision transformer for aerial image classification, wherein the long-term contextual dependencies among regions can be intrinsically encoded.
For region-level modeling, Wang et al. [17] designed a hierarchical deep architecture for discovering attractive objects with different scales. In [18], a focal loss deep architecture is proposed that optimally discovers vehicles from aerial images. In [19], researchers developed a geographic object detection model toward remote sensing images by intelligently extracting intersections as well as streets. In [20], Yu et al. integrated feature enhancement and soft label assignment into an anchor-independent object detector toward aerial images. In [21], Wang et al. proposed a deep rotation-invariant detector that effectively estimates the angles of multi-scale objects inside aerial images. In [22], Chalavadi et al. proposed a parallel deep model called mSODANet that hierarchically learns contextual features from multi-scale and multi-FoV (field-of-views) ground objects. Notably, different from the above methods, our approach is bionic-inspired and accurately mimics human gaze behavior.
Our Proposed Method
A. Topological Feature Engineering
Nowadays, in each UHD aerial image, we can observe tens of multi-scale ground objects. Based on the psychological progresses [24] in the past decade, human observers practically attend to the foreground objects that are visually or semantically prominent when they perceiving the world. Specifically, if humans tend to understand a UHD aerial photo, their visual and cognition subsystem will perceive visually or semantically salient ground objects firstly, e.g., an aircraft and its components. Meanwhile, it is observable that those background regions are nearly neglected. Obviously, it is necessary to encode the visually/semantically perceptual experience when building a UHD aerial image classification pipeline. Herein, a rapid object patches generation as well as a manifold-guided active feature selection are adopted to obtain the ground salient image patches describing different objects.
In our implementation, the well-known BING [25] descriptor is deployed to capture different ground objects because of its following superiorities: 1) achieving a sufficiently high object discovery precision and speed; 2) obtaining multiple highly descriptive and representative object-level patches that effectively simulate how human perceiving different UHD aerial images; and 3) is capable to be generalized to new UHD aerial photo classes. Therefore, the trained categorization algorithm can be transferred onto different data sets. Noticeably, by adopting the aforementioned BING, we still observe too many object-level patches within the UHD aerial images. Actually, we observe that during human visual perception, typically < 10 objects are perceived in each UHD aerial image. To mimic this, a powerful active learning [26] algorithm is utilized to select
Based on these
An example elaborating pairwise object patches that are adjacent spatially. Herein, the red object patch is spatially adjacent with the green one and blue one respectively. The coordinate denotes the position of each object patch in the multi-layer spatial pyramid.
Based on graph theory, the inherent patches associated with the spatial distribution collaboratively determine the visual appearance. In our implementation, for a graphlet, it can be naturally represented by a matrix
B. Our DMCMF
In our work, the have to efficiently and effectively calculate the distance between graphlets from two UHD aerial image, whose image labels might be contaminated. Herein, we formulate a multi-component MF technique optimally handling noises from image labels. Noticeably, our MF can keep the highly informative integer feature encoded in the binary matrix. In theory, the above operation can be represented as:\begin{equation*} \min _{\mathbf {Q},\mathbf {R}} \mathcal {H}(\mathbf {U},\mathbf {QR}^{T})+\Delta (\mathbf {R},\mathbf {Q}), s.t., \mathbf {Q}\in \{-1,1\}, \tag{1}\end{equation*}
\begin{align*} &\min _{\mathbf {M},\mathbf {Q},\mathbf {R}} \mathcal {H}(\mathbf {M},\mathbf {QR}^{T})+{\mathcal {H}}_{l}(\mathbf {M},\mathbf {U})+\Theta (\mathbf {R},\mathbf {Q}), \\ & s.t. \mathbf {M}\in \{-1,1\}, \mathbf {Q}\in \{-1,1\}, \tag{2}\end{align*}
It the graphlet hashing stage, we generally believe the significance of maintaining the local sample distribution [26], e.g., the spatial relationships among spatially neighboring graphlets in the feature space. Meanwhile, we can derive the the hash function accordingly. Such hash function allows pairwise graphlets comparison highly scalable. Such hash function is employed to calculate different binary hash codes, that is, \begin{align*} &\min _{\mathbf {I},\mathbf {B},g} \beta \sum _{i=1}^{n} \mathcal {H}(\mathbf {i}^{i},f(\mathbf {y}_{i})\mathbf {B})+\frac {\delta }{2}\sum _{i=1}^{n}\sum _{j=1}^{n} \mathbf {N}_{ij}||\mathbf {c}^{i}-\mathbf {c}^{j}||, \\ & s.t. \mathbf {I}\in \{-1,1\}^{n\times M}, \tag{3}\end{align*}
\begin{align*} &\min _{\mathbf {I},\mathbf {B},g} \beta \mathcal {H}(\mathbf {I},g(\mathbf {Y}\mathbf {B}))+\delta \text { tr}(\mathbf {I}^{T}\mathbf {LI}), \\ & s.t. \mathbf {I}\in \{-1,1\}^{n\times M}, \tag{4}\end{align*}
To obtain sufficiently compatible MF and the corresponding hash codes, it is naturally to assume that the inherent geometry revealed by our designed MF and the hashing have a shared feature space. In this way, the constructed latent space by our designed MF is the same to the aforementioned Hamming space. In this way, the potential semantics uncovered by the formulated MF with noise-free image labels is utilized to enhance the hashing model. Theoretically, we can set \begin{align*} &\min _{\mathbf {M},\mathbf {R},\mathbf {I},\mathbf {B},g} \mathcal {H}(\mathbf {M},\mathbf {IR}^{T})+{\mathcal {H}}_{l}(\mathbf {M},\mathbf {U})+\beta \mathcal {H}(\mathbf {I},f(\mathbf {Y}\mathbf {B})) \\ &\qquad \quad +\frac {\delta }{2}\text {tr}(\mathbf {I}^{T}\mathbf {LI}), \\ & s.t. \mathbf {M}\in \{-1,1\}^{n\times S}, \mathbf {H}\in \{-1,1\}^{n\times L}, \tag{5}\end{align*}
Notably, the optimizing task formulated above aims to drive the hash function as well as the binary codes by leveraging the pre-constructed sample graph. Such graph is built upon the potentially contaminated image labels. This pre-constructed sample graph maintains intact in the learning stage. This is practically sub-optimal. Actually, we have to progressively adjust the sample graph in the hash codes learning. To this end, the sample graph updating module is also integrated into the learning model. Mathematically, to refine the possibly contaminated image labels, we expect the sample graph \begin{align*} &\min _{\mathbf {M},\mathbf {R},\mathbf {I},\mathbf {B},\mathbf {N},g} \mathcal {H}(\mathbf {M},\mathbf {IR}^{T})+{\mathcal {H}}_{l}(\mathbf {M},\mathbf {U})+\beta \mathcal {H}(\mathbf {N},\mathbf {N}_{0}) \\ &\qquad \quad +\gamma \mathcal {H}(\mathbf {I},g(\mathbf {Y})\mathbf {B})+\frac {\delta }{2}\text {tr}(\mathbf {I}^{T}\mathbf {LI})+\Theta (\mathbf {R},\mathbf {B}), \\ &\quad s.t. \mathbf {M}\in \{-1,1\}^{n\times S},\mathbf {I}\in \{-1,1\}^{n\times M},\sum \nolimits _{j=1}^{n} \mathbf {N}_{ij}=1, \tag{6}\end{align*}
To tackle (6), it is necessary to explicitly define \begin{align*} &\min _{\mathbf {M},\mathbf {R},\mathbf {I},\mathbf {B},\mathbf {N}} \frac {1}{2}||\mathbf {M}-\mathbf {IR}^{T}||+\nu ||\mathbf {M}-\mathbf {U}||_{1}+\frac {\beta }{2}||\mathbf {N}-\mathbf {N}_{0}||_{F}^{2} \\ &\qquad \qquad +\frac {\gamma }{2}||\mathbf {I}-f(\mathbf {Y})\mathbf {B}||_{F}^{2}+\frac {\delta }{2}\text {tr}(\mathbf {I}^{T}\mathbf {LI})+\frac {\mu }{2}||\mathbf {R}||_{F}^{2} \\ &\qquad \qquad +\frac {\theta }{2}||\mathbf {B}||_{F}^{2} \\ &\quad s.t. \mathbf {M}\in \{-1,1\}^{n\times S},\mathbf {I}\in \{-1,1\}^{n\times M},\sum \nolimits _{j=1}^{n} \mathbf {N}_{ij}=1, \tag{7}\end{align*}
C. Kernel Machine for Multi-Label UHD Aerial Photo Categorization
As aforementioned, many graphlets are extracted from each UHD aerial photo and are subsequently converted into binary hash codes. We observe that: 1) the graphlet numbers from different UHD aerial photos are generally inconsistent; 2) the dimensionalities of binary hash codes calculated from graphlets with different vertices are different. Therefore, we cannot directly send them to a standard support vector machine (SVM) for visual categorization. In our implementation, a kernel-based quantizing method is deployed to calculate the feature at image-level, i.e., fixed-length feature vector corresponding to a UHD aerial photo. For each UHD aerial photo, the BING [25] -based object patches are extracted to build graphlets, which are subsequently converted into binary hash codes using our DMCMF. Finally, graphlets within each UHD aerial photo are accumulated into vector \begin{equation*} \mathbf {u}_{i}\propto \text { exp}\left({-\frac {1}{AA'}\sum \nolimits _{a=1}^{A}\sum \nolimits _{b=1}^{A'}dist(\mathbf {h}_{a},\mathbf {h}_{b})}\right), \tag{8}\end{equation*}
By leveraging the vector quantized using (8), we train a SVM classifier for multi-label classification. In theory, to train an SVM distinguishing UHD aerial photos belonging to two different classes, the SVM can be mathematically represented in the following:\begin{align*} &&\hspace {-20pt}\max _{c\in \mathbb {R}^{N_{ab}}} \kappa (c)=\sum \nolimits _{i=1}^{N_{ab}} c_{i}-\frac {1}{2}\sum \nolimits _{i=1}^{N_{ab}} \sum \nolimits _{j=1}^{N_{ab}} d_{i}d_{j}t_{i}t_{j}k(\mathbf {v}_{i},\mathbf {v}_{j}) \\ &&\hspace {-20pt}s.t. 0\leq d_{i}\leq D, \sum \nolimits _{i=1}^{N_{ab}} d_{i}t_{i}=0, \tag{9}\end{align*}
By calculating a quantized vector \begin{equation*} \text { sgn}\left({\sum \nolimits _{j=1}^{N_{ab}} d_{i}t_{i}k(\mathbf {u}_{i},\mathbf {u}^{*})+\eta }\right), \tag{10}\end{equation*}
Experimental Evaluations
Herein, we evaluate the performance of our UHD aerial photo categorization using three experiments. We first introduce our self-compiled data set, which includes 2.3 million UHD aerial images crawled from 100 well-known metropolises from different countries. Based on this, our algorithm is compared with 17 carefully-desinged visual categorization models from three perspectives: accuracy and stability. Meanwhile, we carefully explain the high performance advantage of our classification model. Then, we carefully evaluate each key module during UHD aerial image categorization. Lastly, we report our categorization accuracy of our method under different parameters. Based this, the optimal parameter settings are suggested.
After collecting the million-scale UHD aerial photos, we have to annotate them to obtain the corresponding image-level labels. Herein, 82 volunteers3 first manually annotate 14.7% UHD aerial photos in each metropolitan city, wherein a total of 47 different image-level labels were utilized. Afterward, we train a multi-label SVM and employ it to annotate the image-level labels of the rest UHD aerial images. Then, the same 82 volunteers manually correct the labels calculated by SVM. It is noticeable that multiple image-level labels are associated with intolerably small number of UHD aerial photos. This makes it infeasible to train a generalizable categorization model corresponding to these image-level labels. In our implementation, if the the number of UHD aerial photos corresponding to an image-level label is smaller than 200,000, Then we abandon this label. In this way, we finally obtain 18 different image-level labels. Thereafter, we notice that 99.973% UHD aerial photos have fewer than four image-level labels, while the rest very few UHD aerial photos have larger numbers of image-level labels (from five to 13). These UHD aerial photos usually contain a rich set of small regions (
In retrospect, one key advantage of our method is to robustly learn a categorization model from noisy image-level labels. To acquire the noisy labels for experimentation, for each category, we randomly use 60% UHD aerial photos to construct a training set. Based on this, we learn a multi-label categorization model, which is further leveraged to calculate the labels of the entire UHD aerial photos. In total, there are 11.3% mislabeled UHD aerial photos. They are combined with those correctly labeled ones to constitute our data set.
We observe that each UHD aerial photo in our data set typically takes up 200MB of storage space. Therefore, our 2.3 million UHD aerial photos will require a total of 460TB storage space. To optimally store such million-scale UHD aerial photos for fast I/O interface, we employ the Supermicro server solutions.4 More specifically, we adopt the 4U double-sided super storage platform. The platform is installed with 36 Toshiba HDD drivers, each of which has a 20TB storage space. Totally, the entire storage space of our platform is 720TB and it works in RAID 0 mode. Based on this, the average sequential data reading and writing speeds are respectively 1467MB/s and 862MB/s on our storage platform. That means on average, it takes 0.137s and 0.232s to load and update each UHD aerial photo respectively.
1) Accuracy Comparison
In this section, we evaluate our UHD aerial photo categorization framework by comparing its effectiveness and efficiency with a rich set of baseline recognition algorithms. We first test our algorithm by comparing it with deep aerial image classification models. Thereafter, we employ state-of-the-art deep generic visual categorization algorithms for comparison.
First of all, we compare our method with seven deep visual categorization models [31], [32], [33], [34], [35], [36], [37] that intrinsically incorporate some prior knowledge of different categories of aerial photos. We notice that the source codes of [31], [32], [35], and [36] are publicly available. Based on this, we conduct comparative study wherein the parameter settings are set as default. For [33], [34], and [37], we implement them since the source codes are unavailable. We have tried our best to make the re-implemented categorization models perform similarly to the results reported in their publications.
Meanwhile, many recent deep generic visual recognition models perform impressively on categorizing aerial images. Herein, we first compare our method with ten deep generic object classification algorithms. Moreover, since UHD aerial photo categorization can be considered as a sub-topic of scenery classification, we further conduct a comparative study between our method and three recently published scene classification models [38], [39], [40]. For the categorization models implemented by us, the experimental setups can be summarized in the following. For [33], we utilize the ResDep-128 [41] to function as the backbone. This is further updated into the multi-label variant. Different from the fully-connected layer (unit number is set to 19), the rest deep layers are fixed by the above ResDep-128 [42]. The ResNet-108 [41] is employed as the backbone and the stochastic gradient descent optimizes the entire network. The learning ratio as well as the decay are respectively fixed to 0.001 and 0.05. The network loss is calculated by the mean squared error. For [38], we retrain the object bank [43] by leveraging our refined 18 UHD aerial photo categories, wherein the average-pooling strategy is applied. We employ the liblinear as the solution to the linear classifier, wherein the 7-fold cross evaluation is applied.
For the above 18 compared object/scene recognition algorithms, we repeatedly test each model 20 times. Accordingly, the averaged accuracies are displayed in Table 1. To quantify the stability of these categorization models, we report their standard errors simultaneously. We observe that the per-category standard errors produced by our method are significantly and consistently lower than its competitors. This demonstrated that our method is the most stable. In summary, the following conclusions can be made:
Our method outperforms the other seven aerial photo categorization models remarkably due to three reasons. First, these compared methods typically characterize low/medium resolution aerial photos. To facilitate deep model training, they generally resize the original aerial photo to a fixed and much smaller size (e.g.
) for the subsequent deep modeling. This operation is negative to learning an effective UHD aerial photo categorization model since those tiny but discriminative visual details will be lost. Second, expect for our method, none of the seven counterparts can implicitly correct the noisy image-level labels, which will inevitably hurt the categorization model training. Third, only our method uses graphlets to explicitly capture the complicated spatial layouts of each UHD aerial photo.They are further incorporated by a deep hashing algorithm for calculating the discriminative image kernel. Comparatively, the seven counterparts only globally/locally characterize each UHD aerial photo, wherein the informative spatial layouts among multiple aerial photo regions are neglected.224\times 224 The seven generic object recognition algorithms perform inferiorly than our method because of three reasons. First, these generic recognition models generally handle medium sized images typically containing under ten million pixels. They can hardly discover the tiny but discriminative regions from the hundreds of object components inside an UHD aerial photo with over 100 million pixels. This case is particularly worse when the image-level labels are contaminated. Second, our method can conveniently incorporate some prior knowledge of UHD aerial photo set, e.g., the maximum graphlet size and the category-specific object patches. Contrastively, the seven generic object recognition models cannot encode the domain knowledge reflecting UHD aerial photos. Third, by leveraging our noise-tolerant hashing algorithm, only our method allows a fast and accurate comparison of many discriminative object parts between UHD aerial photos. Nevertheless, the seven generic object recognition models simply convert each UHD aerial photo into a long feature vector for deep classification. They cannot achieve such precise region-to-region comparison like ours.
The three scene categorization models perform unsatisfactorily on UHD aerial photos. This is because they deeply and implicitly learn a descriptive set of scene-aware semantic categories, such as “birds” and “tables”, which usually infrequently appear in our UHD aerial photo set. Moreover, the three categorization methods can successfully handle sceneries captured at horizontal view angles. But our UHD aerial photos are captured at overhead view angles. Apparently, such view angle gap will largely hurt the categorization accuracy.
A. Ablation Study
The two key modules in our work are the DMCMF and kernel-induced feature quantization. Herein, the effectiveness of the two modules are evaluated in our designed categorization pipeline. Each module is changed into a degraded and the performances are recorded accordingly. Meanwhile, insights are provided to elaborate the underlying reasons for the received results.
First of all, we test our key theoretical contribution, the proposed DMCMF. Specifically, we analyze the four functional components as formulated in (7). The label noise refinement component is first abandoned (S11). Mathematically, the term
Lastly, to demonstrate the usefulness of the kernel-based quantized vector calculated from each UHD aerial photo, the following experimental setups are applied. We first use the aggregation-based deep network that accumulates the predicted category labels corresponding to the entire graphlets within an UHD aerial photo. These labels are subsequently combined into the final image-level category label (S31). Thereafter, we replace our adopted linear kernel by polynomial kernel (S32) and Gaussian radial basis function (RBF) (S33) respectively. As shown in Table 2, aggregating the graphlet-level category label severely hurts the categorization accuracy. This is because calculating the category label at graphlet-level is sometimes obscure and misleading. In practice, each graphlet occupies very few regions within each UHD aerial photo, and some regions correspond to the background areas irrelevant to a particular category. Besides, both polynomial and RBF kernels perform inferiorly than our linear kernel. This observation demonstrates that projecting the quantized vectors onto a linear space can better separate UHD aerial photos from different categories.
B. Performance By Varying Parameters
In our work, we have multiple tunable parameters that will be evaluated. The first set denotes the weights balancing different clues in the DMCMF framework. The second set contains parameters influencing deep topological feature engineering. In this experiment, we test the UHD aerial image classification accuracy using different parameter settings.
To analyze the first parameter set, the default values of
UHD aerial photo categorization accuracies by varying the six parameters in the first set.
Next, we evaluate the UHD aerial photo categorization by changing the
Conclusion
Aerial image understanding is an indispensable technique in pattern recognition [44], [45], [46], [47], [48]. We propose a novel deep matrix factorization that optimally fuzes multiple clues into a solvable optimization for multi-label UHD aerial photo categorization. We first extract the BING [25] -based patches describing objects or their parts. Then, multiple graphlets are built to capture the spatial configurations of the ground salient objects that are visually/semantically salient. Afterward, we propose a so-called DMCMF that effectively encodes image labels to improve our binary hashing. Lastly, the binary feature vectors are integrated into a kernel SVM to label each UHD aerial image to multiple categories. Comprehensive experiments on the collected UHD aerial image set reflected the our algorithm’s advantage.