Journals & Magazines >IEEE Access >Volume: 12

Ultra-High-Definition Aerial Photo Categorization by an Enhanced Matrix Factorization Algorithm

The main contributions can be summarized in the following: 1) a million-scale partially mislabeled UHD aerial photos collected from 100 metropolises for validating the su...

Abstract:

In this work, we designed an effective ultra-high-definition (UHD) aerial photo categorization pipeline by designing an enhanced deep multi-clue matrix factorization (DMC...Show More

Metadata

Abstract:

In this work, we designed an effective ultra-high-definition (UHD) aerial photo categorization pipeline by designing an enhanced deep multi-clue matrix factorization (DMCMF). In detail, given a UHD aerial photo, those visually salient ground objects are extracted in the first place. In order to explicitly encode their spatial layout, multiple graphlets are constructed in each UHD aerial photo. Each is built by connecting those spatially neighboring object patches. Afterward, we propose a new matrix factorization (MF) model that intelligently uncover the underlying semantic features from graphlets. And multiple informative clues are encoded into the MF model. Notably, our DMCMF is optimized progressively. And we can represent each graphlet by a vector of binary hash codes. Lastly, each UHD aerial photograph can be effectively quantized into a feature vector by a kernel machine for multi-label categorization. Experiments have shown that our method is highly competitive in learning categorization model from imperfect labels at image-level.

The main contributions can be summarized in the following: 1) a million-scale partially mislabeled UHD aerial photos collected from 100 metropolises for validating the su...

Published in: IEEE Access ( Volume: 12)

Page(s): 12053 - 12061

Date of Publication: 18 December 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3344164

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Thanks to the technology of delivering multiple satellites in a single rocket launch, thousands of earth observation satellites were sent into space recently. The satellites capture UHD aerial photos (typical resolutions over $5K\times 5K$ ) containing ground objects with sophisticated spatial interactions, such as reticular, star, and triangle. Semantically understanding these ground objects and their spatial topologies is a useful technique in many state-of-the-art intelligent systems. For example, it is significant to fast recognize the complicated street networks, e.g., star and tree geometries, to optimize vehicle path planning (i.e., calculating the shortest path between pairwise locations). Actually, it is feasible to represent the above topologies using a small graph. Each edge connects two adjacent streets.

In computer vision, dozens of shallow/deep visual categorization/annotation models were designed to describe aerial photos with regular resolutions (typically $800\times 800\sim 2K\times 2K$ ). Well-known models involves: 1) CNN (convolutional neural network)-based object localization using weak labels [1], [2]; 2) graph models for semantically annotating aerial photos [3], [4]; and 3) elaborately-developed deep models for semantically understanding aerial images [5], [6], [7]. Nevertheless, to our best knowledge, the current deep models fail to satisfactorily characterize UHD aerial photos because of the following reasons:

Actually, a high-resolution aerial photo typically has lots of objects distributed with various spatial configurations. Accurately uncovering the underlying semantic features is nontrivial. Possible challenges include: i) computationally and spatially modeling the complicated layout of the ground objects, and ii) formulating a deep model converting the spatially modeled features to fixed-length visual features. Moreover, spatially transforming different layouts inside each UHD aerial image into some traditional classification model [8] is a true difficulty;
The large number of objects within each UHD aerial photo makes it labor-intensive to accurately annotate all the ground objects. Because of the progresses in weakly-labeled feature engineering, we solely require image-level labels are required for calculating semantics at region-level. Therefore, to exploit the regional-level visual semantics within a UHD aerial photo, it is necessary to uncover the corresponding weak and user-defined labels. Noticeably, the above user-defined labels are subjectively defined in practice. And sometimes they are noisy. In practice, establishing such a noise-tolerant refinement mechanism is difficult;.
To build a powerful UHD aerial photo categorization framework, we have to model the inherent sample distributions on manifold precisely. Practically, however, because of the contaminated user-defined labels, our previously calculated sample distribution is usually imperfect. Practically, what we need is a mathematical model that adaptively learns the optimal sample interaction when refining the labels. Actually, formulating a solvable and learnable framework requires domain experiences.

In our work, we propose a deep multi-clue matrix factorization (DMCMF) framework for multi-label UHD aerial photo categorization. The core technique is an enhanced MF which hierarchically converts the graphlets within a UHD aerial photo to corresponding binary features. Herein, the potentially noisy labels can be theoretically abandoned and the data graph can be progressively optimized. The entire pipeline of the proposed method is displayed in Fig. 1. In detail, for massive-scale UHD aerial photos, each contains user-defined labels that are potentially noisy, a succinct set of object-guided image regions are extracted in the first place. Thereafter, a rich set of object patches that spatially neighboring are collected to build different graphlets. They can accurately encode the various topologies inside multiple UHD aerial images. Based on this, we build a matrix factorization (MF) algorithm that is robust to label noises. The MF can effectively convert graphlets to the binary feature accordingly. Thereby, we can compare two graphlets rapidly and mathematically. Our proposed MF can optimally encode four descriptive visual clues. By leveraging the binary vectors calculated from different graphlets, we convert graphlets within a UHD aerial image to the kernel-induced feature vector. In this way, an effective classifier is learned to classify different aerial images into the corresponding classes. Plenty of quantitative comparisons to the well-known deep classification models indicated the competitiveness of the learned classification model.

FIGURE 1.

The pipeline of our proposed noise-robust binary matrix factorization (MF) framework for UHD aerial photo categorization.

Show All

The main contributions can be summarized in the following: 1) a million-scale partially mislabeled UHD aerial photos collected from 100 metropolises for validating the superiority of our method, 2) a DMCMF that collaboratively and seamlessly incorporates four clues to compute the hash codes of each graphlet; and 3) a novel UHD aerial photo categorization model that avoids noisy image-level labels and adaptively updates the data distribution.

Organization of the remaining parts of article is given as follows. In Sec II we review the published work closely related to ours. Sec III delineates the proposed pipeline, including graphlet construction, our enhanced MF, and the kernel-induced feature learning. Experimental validation in Sec IV tests the effectiveness of our method. The last section concludes.

SECTION II.

Related Work

Dozens of computational visual models were developed to analyze aerial photos.1 To semantically model the entire image, Zhang et al. [9] constructed a novel topological feature to model the inter-region connection inside each aerial photo. And a kernel-induced vector is calculated as the image representation for categorization. Xia et al. [10] formulated a weak learning model that semantically labels HR aerial photos at image-level. Akar [11] carefully combine the so-called random forest and object-level feature extractor to classify remote sensing images. Sameen et al. [12] developed a hierarchical deep architecture to calculate the multiple labels of HR aerial photos describing many downtown areas. In [13], Cheng et al. utilized a pre-trained deep CNN to classify high-resolution remote sensing images. A domain-specific scenic picture set is leveraged to fine tune the deep architecture. In [14], a cross-modality learning framework is proposed to collaboratively learn five deep models for categorizing aerial images, wherein pixel-level and spatial-level features are exploited complementarily. The authors [15] designed a novel inter-attentional algorithm to learn the weights of aerial image features both horizontally and vertically. In [16], Bazi et al. formulated a vision transformer for aerial image classification, wherein the long-term contextual dependencies among regions can be intrinsically encoded.

For region-level modeling, Wang et al. [17] designed a hierarchical deep architecture for discovering attractive objects with different scales. In [18], a focal loss deep architecture is proposed that optimally discovers vehicles from aerial images. In [19], researchers developed a geographic object detection model toward remote sensing images by intelligently extracting intersections as well as streets. In [20], Yu et al. integrated feature enhancement and soft label assignment into an anchor-independent object detector toward aerial images. In [21], Wang et al. proposed a deep rotation-invariant detector that effectively estimates the angles of multi-scale objects inside aerial images. In [22], Chalavadi et al. proposed a parallel deep model called mSODANet that hierarchically learns contextual features from multi-scale and multi-FoV (field-of-views) ground objects. Notably, different from the above methods, our approach is bionic-inspired and accurately mimics human gaze behavior.

SECTION III.

Our Proposed Method

A. Topological Feature Engineering

Nowadays, in each UHD aerial image, we can observe tens of multi-scale ground objects. Based on the psychological progresses [24] in the past decade, human observers practically attend to the foreground objects that are visually or semantically prominent when they perceiving the world. Specifically, if humans tend to understand a UHD aerial photo, their visual and cognition subsystem will perceive visually or semantically salient ground objects firstly, e.g., an aircraft and its components. Meanwhile, it is observable that those background regions are nearly neglected. Obviously, it is necessary to encode the visually/semantically perceptual experience when building a UHD aerial image classification pipeline. Herein, a rapid object patches generation as well as a manifold-guided active feature selection are adopted to obtain the ground salient image patches describing different objects.

In our implementation, the well-known BING [25] descriptor is deployed to capture different ground objects because of its following superiorities: 1) achieving a sufficiently high object discovery precision and speed; 2) obtaining multiple highly descriptive and representative object-level patches that effectively simulate how human perceiving different UHD aerial images; and 3) is capable to be generalized to new UHD aerial photo classes. Therefore, the trained categorization algorithm can be transferred onto different data sets. Noticeably, by adopting the aforementioned BING, we still observe too many object-level patches within the UHD aerial images. Actually, we observe that during human visual perception, typically < 10 objects are perceived in each UHD aerial image. To mimic this, a powerful active learning [26] algorithm is utilized to select $K$ descriptive object-level patches from a UHD aerial photo. The algorithm integrates two features: 1) the spatial configurations to each image and 2) visual semantics from image-level labels.

Based on these $K$ patches, we can build a graphlet by the random walking algorithm [27]. The random walk is conducted on multiple neighboring patches. As shown in Fig. 2, we first construct a multi-layer spatial pyramid, based on which two patches are treated as spatially adjacent if the corresponding cells are neighboring. Afterward, we randomly choose an initial patch. Next, we jump to a spatially adjacent object patch. Such jumping operation step if the graphlet size reaches our pre-defined upper bound. Afterward, we collect these adjacent patches (along the path of random walk) to form a graphlet. Herein, we present an example of how to build graphlet with three vertices is Fig. 2.

FIGURE 2.

An example elaborating pairwise object patches that are adjacent spatially. Herein, the red object patch is spatially adjacent with the green one and blue one respectively. The coordinate denotes the position of each object patch in the multi-layer spatial pyramid.

Show All

Based on graph theory, the inherent patches associated with the spatial distribution collaboratively determine the visual appearance. In our implementation, for a graphlet, it can be naturally represented by a matrix $\textbf {A} = [\textbf {A}_{1},\textbf {A}_{2}]$ , where $\textbf {A}_{1}$ denotes a matrix whose size is $K\times 137$ . Herein, the 137 dimensions are obtained by concatenating a 9-dimensional color moment [28] and a 128-dimensional HOG [29]. $\textbf {A}_{2}$ denotes a matrix whose size is $K\times K$ . In detail, when the $i$ -th and $j$ -th patches are spatially adjacent, then we set $\textbf {A}_{2}(i, j)$ to one, and otherwise we set $\textbf {A}_{2}(i, j)$ to zero. In order to receive a conventional feature, we row-wise condense matrix A into a long vector x.

B. Our DMCMF

In our work, the have to efficiently and effectively calculate the distance between graphlets from two UHD aerial image, whose image labels might be contaminated. Herein, we formulate a multi-component MF technique optimally handling noises from image labels. Noticeably, our MF can keep the highly informative integer feature encoded in the binary matrix. In theory, the above operation can be represented as:

$\begin{equation*} \min _{\mathbf {Q},\mathbf {R}} \mathcal {H}(\mathbf {U},\mathbf {QR}^{T})+\Delta (\mathbf {R},\mathbf {Q}), s.t., \mathbf {Q}\in \{-1,1\}, \tag{1}\end{equation*}$ View Source

where

$\mathbf {R}\in \mathbb {R}^{c\times t}$

and

$\mathbf {Q}\in \mathbb {R}^{n\times t}$

are respectively the image labels as well as aerial images distributed in the hidden space.

$\mathcal {J}$

calculate the error during the MF and

$\Theta (\cdot)$

denotes some pre-defined regularizer to avoid overfitting. In practice, we notice that image labels

$\mathbf {T}$

are usually noisy. Obviously, label noises will cause unsatisfactory MF operation. In order to optimally tackle this problem, we proposed to derive a noise-free image label matrix

$\mathbf {M}$

from the potentially noisy label matrix. Theoretically, by inspecting the label matrix construction, element

$\mathbf {M}_{ij}$

denotes a binary indicator reflecting the correlation between the labels associated with pairwise aerial images. Thereby, an upgraded optimized task obtained:

$\begin{align*} &\min _{\mathbf {M},\mathbf {Q},\mathbf {R}} \mathcal {H}(\mathbf {M},\mathbf {QR}^{T})+{\mathcal {H}}_{l}(\mathbf {M},\mathbf {U})+\Theta (\mathbf {R},\mathbf {Q}), \\ & s.t. \mathbf {M}\in \{-1,1\}, \mathbf {Q}\in \{-1,1\}, \tag{2}\end{align*}$

View Source

where

${\mathcal {H}}_{l}$

denotes the error of constructing these noise-free label matrix using the potentially noisy one.

It the graphlet hashing stage, we generally believe the significance of maintaining the local sample distribution [26], e.g., the spatial relationships among spatially neighboring graphlets in the feature space. Meanwhile, we can derive the the hash function accordingly. Such hash function allows pairwise graphlets comparison highly scalable. Such hash function is employed to calculate different binary hash codes, that is, $\mathbf {i} = \text {sgn}(g(x)\mathbf {B})$ . In summary, the below optimization can be obtained:

$\begin{align*} &\min _{\mathbf {I},\mathbf {B},g} \beta \sum _{i=1}^{n} \mathcal {H}(\mathbf {i}^{i},f(\mathbf {y}_{i})\mathbf {B})+\frac {\delta }{2}\sum _{i=1}^{n}\sum _{j=1}^{n} \mathbf {N}_{ij}||\mathbf {c}^{i}-\mathbf {c}^{j}||, \\ & s.t. \mathbf {I}\in \{-1,1\}^{n\times M}, \tag{3}\end{align*}$ View Source

(3) can be updated into the corresponding matrix form, i.e.,

$\begin{align*} &\min _{\mathbf {I},\mathbf {B},g} \beta \mathcal {H}(\mathbf {I},g(\mathbf {Y}\mathbf {B}))+\delta \text { tr}(\mathbf {I}^{T}\mathbf {LI}), \\ & s.t. \mathbf {I}\in \{-1,1\}^{n\times M}, \tag{4}\end{align*}$

View Source

Herein,

$\gamma$

and

$\delta$

denote two positive parameters reflecting the significance of the two terms accordingly.

$M$

is the Hamming space’s dimension.

$\mathbf {L}\in \mathbb {R}^{n\times n}$

represents the Laplacian matrix derived based on

$\mathbf {L} = \mathbf {C}-\mathbf {L}$

. Herein,

$\mathbf {C}$

denotes a diagonal matrix wherein each diagonal entity is calculated as

$\mathbf {C}_{ii}=\sum \nolimits _{j=1}^{n} \mathbf {N}_{ij}$

. As the formulation in (4), the hash codes corresponding to aerial images as well as the hash function can be jointly calculated.

To obtain sufficiently compatible MF and the corresponding hash codes, it is naturally to assume that the inherent geometry revealed by our designed MF and the hashing have a shared feature space. In this way, the constructed latent space by our designed MF is the same to the aforementioned Hamming space. In this way, the potential semantics uncovered by the formulated MF with noise-free image labels is utilized to enhance the hashing model. Theoretically, we can set $\mathbf {I} = \mathbf {Q}$ and $M = t$ . Therefore, we can obtain the below formulation as:

$\begin{align*} &\min _{\mathbf {M},\mathbf {R},\mathbf {I},\mathbf {B},g} \mathcal {H}(\mathbf {M},\mathbf {IR}^{T})+{\mathcal {H}}_{l}(\mathbf {M},\mathbf {U})+\beta \mathcal {H}(\mathbf {I},f(\mathbf {Y}\mathbf {B})) \\ &\qquad \quad +\frac {\delta }{2}\text {tr}(\mathbf {I}^{T}\mathbf {LI}), \\ & s.t. \mathbf {M}\in \{-1,1\}^{n\times S}, \mathbf {H}\in \{-1,1\}^{n\times L}, \tag{5}\end{align*}$ View Source

Herein,

$S$

denotes the category number of the UHD aerial images.

Notably, the optimizing task formulated above aims to drive the hash function as well as the binary codes by leveraging the pre-constructed sample graph. Such graph is built upon the potentially contaminated image labels. This pre-constructed sample graph maintains intact in the learning stage. This is practically sub-optimal. Actually, we have to progressively adjust the sample graph in the hash codes learning. To this end, the sample graph updating module is also integrated into the learning model. Mathematically, to refine the possibly contaminated image labels, we expect the sample graph $\mathbf {N}$ can be progressively updated during model learning. In the model, the similarities between each graphlet and the entire different ones sum to one, and $\mathbf {N}_{ii}= 0$ . Thus, the mathematical formulation in (5) is reorganized as:

$\begin{align*} &\min _{\mathbf {M},\mathbf {R},\mathbf {I},\mathbf {B},\mathbf {N},g} \mathcal {H}(\mathbf {M},\mathbf {IR}^{T})+{\mathcal {H}}_{l}(\mathbf {M},\mathbf {U})+\beta \mathcal {H}(\mathbf {N},\mathbf {N}_{0}) \\ &\qquad \quad +\gamma \mathcal {H}(\mathbf {I},g(\mathbf {Y})\mathbf {B})+\frac {\delta }{2}\text {tr}(\mathbf {I}^{T}\mathbf {LI})+\Theta (\mathbf {R},\mathbf {B}), \\ &\quad s.t. \mathbf {M}\in \{-1,1\}^{n\times S},\mathbf {I}\in \{-1,1\}^{n\times M},\sum \nolimits _{j=1}^{n} \mathbf {N}_{ij}=1, \tag{6}\end{align*}$ View Source

During hash codes learning, the updating of the Laplacian matrix is

$\mathbf {L} =\mathbf {A}(\mathbf {N}+\mathbf {N}^{T})/2$

. Herein, in the model initialization,

$\mathbf {N}_{0}$

is computed by leveraging

$\mathbf {U}$

. And this objective function optimally incorporates hash codes calculation, semantic feature learning, and sample graph adjustment into an effective model.

To tackle (6), it is necessary to explicitly define $\mathcal {H}$ , ${\mathcal {H}}_{l}$ and $\Theta$ . In our implementation, we employ the least square loss $\mathcal {H}(a,b)=\frac {1}{2}(a-b)^{2}$ . In order to maximally eliminate the noisy image labels, we have ${\mathscr {h}}_{l}(c, d) =\mu |c-d|$ . To obtain the regularization terms, it is common to have $\Theta (\mathbf {U},\mathbf {V})=\frac {\lambda }{2}||\mathbf {U}||_{F}^{2}+\frac {\eta }{2}||\mathbf {V}||_{F}^{2}$ . Totally, we can upgrade (6) as follows:

$\begin{align*} &\min _{\mathbf {M},\mathbf {R},\mathbf {I},\mathbf {B},\mathbf {N}} \frac {1}{2}||\mathbf {M}-\mathbf {IR}^{T}||+\nu ||\mathbf {M}-\mathbf {U}||_{1}+\frac {\beta }{2}||\mathbf {N}-\mathbf {N}_{0}||_{F}^{2} \\ &\qquad \qquad +\frac {\gamma }{2}||\mathbf {I}-f(\mathbf {Y})\mathbf {B}||_{F}^{2}+\frac {\delta }{2}\text {tr}(\mathbf {I}^{T}\mathbf {LI})+\frac {\mu }{2}||\mathbf {R}||_{F}^{2} \\ &\qquad \qquad +\frac {\theta }{2}||\mathbf {B}||_{F}^{2} \\ &\quad s.t. \mathbf {M}\in \{-1,1\}^{n\times S},\mathbf {I}\in \{-1,1\}^{n\times M},\sum \nolimits _{j=1}^{n} \mathbf {N}_{ij}=1, \tag{7}\end{align*}$ View Source

Obviously, (7) is a non-convex optimization. In our implementation, we propose to iteratively solve this objective function. The specific solutions are presented in the following link.2 Moreover, following [30], we extend (7) into a deep learning architecture containing

$F$

layers.

C. Kernel Machine for Multi-Label UHD Aerial Photo Categorization

As aforementioned, many graphlets are extracted from each UHD aerial photo and are subsequently converted into binary hash codes. We observe that: 1) the graphlet numbers from different UHD aerial photos are generally inconsistent; 2) the dimensionalities of binary hash codes calculated from graphlets with different vertices are different. Therefore, we cannot directly send them to a standard support vector machine (SVM) for visual categorization. In our implementation, a kernel-based quantizing method is deployed to calculate the feature at image-level, i.e., fixed-length feature vector corresponding to a UHD aerial photo. For each UHD aerial photo, the BING [25] -based object patches are extracted to build graphlets, which are subsequently converted into binary hash codes using our DMCMF. Finally, graphlets within each UHD aerial photo are accumulated into vector $\mathbf {u}=\{u_{1},u_{2},\cdots,u_{A}\}$ , where $A$ counts the training UHD aerial photos. Mathematically, we can calculate feature vector $\mathbf {u}$ ’s entity is computed as:

$\begin{equation*} \mathbf {u}_{i}\propto \text { exp}\left({-\frac {1}{AA'}\sum \nolimits _{a=1}^{A}\sum \nolimits _{b=1}^{A'}dist(\mathbf {h}_{a},\mathbf {h}_{b})}\right), \tag{8}\end{equation*}$ View Source

where

$A$

and

$A'$

count the equally-sized graphlets from two UHD aerial photos respectively;

$dist(\cdot,\cdot)$

computes the Jaccard similarity between binary hash codes.

By leveraging the vector quantized using (8), we train a SVM classifier for multi-label classification. In theory, to train an SVM distinguishing UHD aerial photos belonging to two different classes, the SVM can be mathematically represented in the following:

$\begin{align*} &&\hspace {-20pt}\max _{c\in \mathbb {R}^{N_{ab}}} \kappa (c)=\sum \nolimits _{i=1}^{N_{ab}} c_{i}-\frac {1}{2}\sum \nolimits _{i=1}^{N_{ab}} \sum \nolimits _{j=1}^{N_{ab}} d_{i}d_{j}t_{i}t_{j}k(\mathbf {v}_{i},\mathbf {v}_{j}) \\ &&\hspace {-20pt}s.t. 0\leq d_{i}\leq D, \sum \nolimits _{i=1}^{N_{ab}} d_{i}t_{i}=0, \tag{9}\end{align*}$ View Source

where

$\mathbf {v}_{i}$

denotes the calculated vector from each UHD aerial photo during training;

$t_{i}$

labels the

$i$

-th UHD aerial image;

$\kappa$

denotes the hyperplane separating samples from different categories;

$D > 0$

trades the machine complexity off those mislabeled samples; and

$N_{ab}$

denotes the number of samples from the all the categories.

By calculating a quantized vector $\mathbf {u}$ corresponding to a testing UHD aerial image, we can obtain the image labels is obtained as:

$\begin{equation*} \text { sgn}\left({\sum \nolimits _{j=1}^{N_{ab}} d_{i}t_{i}k(\mathbf {u}_{i},\mathbf {u}^{*})+\eta }\right), \tag{10}\end{equation*}$ View Source

Herein,

$\eta =1-\sum \nolimits _{i=1}^{N_{ab}} d_{i}t_{i}k(\mathbf {u}_{i},\mathbf {u}_{k})$

. Besides,

$\mathbf {u}_{k}$

represents the support vector corresponding to category label ‘+1’. In testing, we assign

$\mathbf {u}^{*}$

to the label set receiving the maximum number of votes.

SECTION IV.

Experimental Evaluations

Herein, we evaluate the performance of our UHD aerial photo categorization using three experiments. We first introduce our self-compiled data set, which includes 2.3 million UHD aerial images crawled from 100 well-known metropolises from different countries. Based on this, our algorithm is compared with 17 carefully-desinged visual categorization models from three perspectives: accuracy and stability. Meanwhile, we carefully explain the high performance advantage of our classification model. Then, we carefully evaluate each key module during UHD aerial image categorization. Lastly, we report our categorization accuracy of our method under different parameters. Based this, the optimal parameter settings are suggested.

After collecting the million-scale UHD aerial photos, we have to annotate them to obtain the corresponding image-level labels. Herein, 82 volunteers3 first manually annotate 14.7% UHD aerial photos in each metropolitan city, wherein a total of 47 different image-level labels were utilized. Afterward, we train a multi-label SVM and employ it to annotate the image-level labels of the rest UHD aerial images. Then, the same 82 volunteers manually correct the labels calculated by SVM. It is noticeable that multiple image-level labels are associated with intolerably small number of UHD aerial photos. This makes it infeasible to train a generalizable categorization model corresponding to these image-level labels. In our implementation, if the the number of UHD aerial photos corresponding to an image-level label is smaller than 200,000, Then we abandon this label. In this way, we finally obtain 18 different image-level labels. Thereafter, we notice that 99.973% UHD aerial photos have fewer than four image-level labels, while the rest very few UHD aerial photos have larger numbers of image-level labels (from five to 13). These UHD aerial photos usually contain a rich set of small regions ( $< 200\times 200$ ) that are possibly contaminated. Thus we simply abandon these UHD aerial photos. Lastly, we order the entire UHD aerial photos by their file names. For each category, we use the first half UHD aerial photos for training while the rest samples are for testing.

In retrospect, one key advantage of our method is to robustly learn a categorization model from noisy image-level labels. To acquire the noisy labels for experimentation, for each category, we randomly use 60% UHD aerial photos to construct a training set. Based on this, we learn a multi-label categorization model, which is further leveraged to calculate the labels of the entire UHD aerial photos. In total, there are 11.3% mislabeled UHD aerial photos. They are combined with those correctly labeled ones to constitute our data set.

We observe that each UHD aerial photo in our data set typically takes up 200MB of storage space. Therefore, our 2.3 million UHD aerial photos will require a total of 460TB storage space. To optimally store such million-scale UHD aerial photos for fast I/O interface, we employ the Supermicro server solutions.4 More specifically, we adopt the 4U double-sided super storage platform. The platform is installed with 36 Toshiba HDD drivers, each of which has a 20TB storage space. Totally, the entire storage space of our platform is 720TB and it works in RAID 0 mode. Based on this, the average sequential data reading and writing speeds are respectively 1467MB/s and 862MB/s on our storage platform. That means on average, it takes 0.137s and 0.232s to load and update each UHD aerial photo respectively.

1) Accuracy Comparison

In this section, we evaluate our UHD aerial photo categorization framework by comparing its effectiveness and efficiency with a rich set of baseline recognition algorithms. We first test our algorithm by comparing it with deep aerial image classification models. Thereafter, we employ state-of-the-art deep generic visual categorization algorithms for comparison.

First of all, we compare our method with seven deep visual categorization models [31], [32], [33], [34], [35], [36], [37] that intrinsically incorporate some prior knowledge of different categories of aerial photos. We notice that the source codes of [31], [32], [35], and [36] are publicly available. Based on this, we conduct comparative study wherein the parameter settings are set as default. For [33], [34], and [37], we implement them since the source codes are unavailable. We have tried our best to make the re-implemented categorization models perform similarly to the results reported in their publications.

Meanwhile, many recent deep generic visual recognition models perform impressively on categorizing aerial images. Herein, we first compare our method with ten deep generic object classification algorithms. Moreover, since UHD aerial photo categorization can be considered as a sub-topic of scenery classification, we further conduct a comparative study between our method and three recently published scene classification models [38], [39], [40]. For the categorization models implemented by us, the experimental setups can be summarized in the following. For [33], we utilize the ResDep-128 [41] to function as the backbone. This is further updated into the multi-label variant. Different from the fully-connected layer (unit number is set to 19), the rest deep layers are fixed by the above ResDep-128 [42]. The ResNet-108 [41] is employed as the backbone and the stochastic gradient descent optimizes the entire network. The learning ratio as well as the decay are respectively fixed to 0.001 and 0.05. The network loss is calculated by the mean squared error. For [38], we retrain the object bank [43] by leveraging our refined 18 UHD aerial photo categories, wherein the average-pooling strategy is applied. We employ the liblinear as the solution to the linear classifier, wherein the 7-fold cross evaluation is applied.

For the above 18 compared object/scene recognition algorithms, we repeatedly test each model 20 times. Accordingly, the averaged accuracies are displayed in Table 1. To quantify the stability of these categorization models, we report their standard errors simultaneously. We observe that the per-category standard errors produced by our method are significantly and consistently lower than its competitors. This demonstrated that our method is the most stable. In summary, the following conclusions can be made:

Our method outperforms the other seven aerial photo categorization models remarkably due to three reasons. First, these compared methods typically characterize low/medium resolution aerial photos. To facilitate deep model training, they generally resize the original aerial photo to a fixed and much smaller size (e.g. $224\times 224$ ) for the subsequent deep modeling. This operation is negative to learning an effective UHD aerial photo categorization model since those tiny but discriminative visual details will be lost. Second, expect for our method, none of the seven counterparts can implicitly correct the noisy image-level labels, which will inevitably hurt the categorization model training. Third, only our method uses graphlets to explicitly capture the complicated spatial layouts of each UHD aerial photo.They are further incorporated by a deep hashing algorithm for calculating the discriminative image kernel. Comparatively, the seven counterparts only globally/locally characterize each UHD aerial photo, wherein the informative spatial layouts among multiple aerial photo regions are neglected.
The seven generic object recognition algorithms perform inferiorly than our method because of three reasons. First, these generic recognition models generally handle medium sized images typically containing under ten million pixels. They can hardly discover the tiny but discriminative regions from the hundreds of object components inside an UHD aerial photo with over 100 million pixels. This case is particularly worse when the image-level labels are contaminated. Second, our method can conveniently incorporate some prior knowledge of UHD aerial photo set, e.g., the maximum graphlet size and the category-specific object patches. Contrastively, the seven generic object recognition models cannot encode the domain knowledge reflecting UHD aerial photos. Third, by leveraging our noise-tolerant hashing algorithm, only our method allows a fast and accurate comparison of many discriminative object parts between UHD aerial photos. Nevertheless, the seven generic object recognition models simply convert each UHD aerial photo into a long feature vector for deep classification. They cannot achieve such precise region-to-region comparison like ours.
The three scene categorization models perform unsatisfactorily on UHD aerial photos. This is because they deeply and implicitly learn a descriptive set of scene-aware semantic categories, such as “birds” and “tables”, which usually infrequently appear in our UHD aerial photo set. Moreover, the three categorization methods can successfully handle sceneries captured at horizontal view angles. But our UHD aerial photos are captured at overhead view angles. Apparently, such view angle gap will largely hurt the categorization accuracy.

TABLE 1 Accuracies With Standard Errors of the 18 Categorization Models. (We Repeat Each Experiment 20 Times and Report the Average Accuracies and Each Bold Number Represents the Best Result.)

A. Ablation Study

The two key modules in our work are the DMCMF and kernel-induced feature quantization. Herein, the effectiveness of the two modules are evaluated in our designed categorization pipeline. Each module is changed into a degraded and the performances are recorded accordingly. Meanwhile, insights are provided to elaborate the underlying reasons for the received results.

First of all, we test our key theoretical contribution, the proposed DMCMF. Specifically, we analyze the four functional components as formulated in (7). The label noise refinement component is first abandoned (S11). Mathematically, the term $\nu ||\mathbf {M}-\mathbf {U}||_{1}$ is removed and we update $\mathbf {L}$ into $\mathbf {T}$ . Afterward, the data graph updating term $\frac {\beta }{2}||\mathbf {N}-\mathbf {N}_{0}||_{F}^{2}$ ) is abandoned, wherein the remaining components keep intact (S12). Then, the binary hash codes constraint is removed and we maintain the rest terms unchanged (S13). Last but not least, the hierarchical feature engineering term is reduced to a flat one (S24) by setting $F=1$ . The results in Table 2 have shown that, label noise refinement and hierarchical feature learning models play the most important roles. This is because removing each will cause an $>6.4\%$ classification accuracy drop. Moreover, abandoning the limitation of binary codes will bring a 4.522% accuracy decrement. Even worse, the time consumed at the test stage significantly increased by over seven times. This clearly shows the effectiveness and efficiency of adopting binary codes to characterize UHD aerial photos.

TABLE 2 Categorization Accuracy Drop (“− ”) and Improvement (“+”) in Our Ablation Study

Lastly, to demonstrate the usefulness of the kernel-based quantized vector calculated from each UHD aerial photo, the following experimental setups are applied. We first use the aggregation-based deep network that accumulates the predicted category labels corresponding to the entire graphlets within an UHD aerial photo. These labels are subsequently combined into the final image-level category label (S31). Thereafter, we replace our adopted linear kernel by polynomial kernel (S32) and Gaussian radial basis function (RBF) (S33) respectively. As shown in Table 2, aggregating the graphlet-level category label severely hurts the categorization accuracy. This is because calculating the category label at graphlet-level is sometimes obscure and misleading. In practice, each graphlet occupies very few regions within each UHD aerial photo, and some regions correspond to the background areas irrelevant to a particular category. Besides, both polynomial and RBF kernels perform inferiorly than our linear kernel. This observation demonstrates that projecting the quantized vectors onto a linear space can better separate UHD aerial photos from different categories.

B. Performance By Varying Parameters

In our work, we have multiple tunable parameters that will be evaluated. The first set denotes the weights balancing different clues in the DMCMF framework. The second set contains parameters influencing deep topological feature engineering. In this experiment, we test the UHD aerial image classification accuracy using different parameter settings.

To analyze the first parameter set, the default values of $\nu$ , $\beta$ , $\gamma$ , $\delta$ , $\mu$ and $\theta$ are set to to 0.2, 0.3, 0.1, 0.15, 0.3, 0.3 respectively. In our implementation, the default values are determined by 10-fold cross validation. Herein, the validation set contains 18000 samples, which is constituted by selecting 1000 UHD aerial photos from each category. More specifically, we tune each parameter from 0.05 to one with a step of 0.05. And all the possible parameter combinations are enumeratively employed to test the UHD aerial photo categorization accuracy. The parameter combination receiving the highest categorization accuracy is preserved as the default values. Based on this, we adjust one of the five parameters while keep the others unchanged. Each parameter is increased from 0.05 to one with step of 0.01, wherein the corresponding performance is reported. As the six curves displayed in Fig. 3, the six parameters consistently increase stably and then peak. Afterward, they all decrease to a low level. Such monotonicity properties indicate the feasibility to tune the six parameters toward an optimal level in practice.

FIGURE 3.

UHD aerial photo categorization accuracies by varying the six parameters in the first set.

Show All

Next, we evaluate the UHD aerial photo categorization by changing the $L$ (maximum size of graphlet) and $F$ (deep layer number). For both $L$ and $F$ , we tune them from one to ten and record the corresponding recognition performance. In Fig. 4, when increasing $L$ , the categorization accuracy increases shapely if $L\in [{1,5}]$ and then keeps stable when $L>5$ . Meanwhile, we notice that when $L$ goes up, the time and storage costs increase dramatically since more graphlets will be generated. Toward an efficient and effective UHD aerial photo categorization system, we set $L=5$ . Moreover, we observe that the highest categorization accuracy is achieved when there are four deep layers. To our best knowledge, too few deep layers will make the deeply-learned binary hash codes insufficiently discriminative. Meanwhile, too many deep layers will increase the number of deep model parameters, which inevitably causes deep model overfitting.

FIGURE 4.

Recognition accuracies by varying $L$ (top) and $F$ (bottom) respectively.

Show All

SECTION V.

Conclusion

Aerial image understanding is an indispensable technique in pattern recognition [44], [45], [46], [47], [48]. We propose a novel deep matrix factorization that optimally fuzes multiple clues into a solvable optimization for multi-label UHD aerial photo categorization. We first extract the BING [25] -based patches describing objects or their parts. Then, multiple graphlets are built to capture the spatial configurations of the ground salient objects that are visually/semantically salient. Afterward, we propose a so-called DMCMF that effectively encodes image labels to improve our binary hashing. Lastly, the binary feature vectors are integrated into a kernel SVM to label each UHD aerial image to multiple categories. Comprehensive experiments on the collected UHD aerial image set reflected the our algorithm’s advantage.

References is not available for this document.

Ultra-High-Definition Aerial Photo Categorization by an Enhanced Matrix Factorization Algorithm

Abstract:

Metadata

Abstract:

Introduction

Related Work