LR Aerial Photo Categorization by Semi-Supervised Perceptual Feature Selection

Recognizing the semantic categories of low-resolution (LR) aerial photos is an indispensable technique in geoscience and remote sensing. However, it is also a challenging task in practice. In this work, a semi-supervised perceptual feature selection (SPFS) pipeline is proposed for LR aerial photo categorization, focusing on selecting high quality perception-guided visual features. Specifically, by mimicking human vision system, a novel low-rank model is designed to decompose each LR aerial photo into multiple visually or semantically salient foreground regions coupled with the background non-salient regions. This model can: 1) produce the a gaze shifting path (GSP) simulating human gaze behavior; and 2) generate hierarchical deep representation for a GSP. Afterward, a semi-supervised feature selection (FS) is leveraged toward a succinct set of discriminative deep GSP features, wherein only labels of LR aerial photos are required. Based on the selected features, a classifier is trained for visual categorization. Comprehensive experimental results have validated our method’s advantage.


I. INTRODUCTION
Owing to the remarkable progress in carrier rocket, remote sensing, and satellite communication, hundreds of earth observation satellites have been launched since October 1957.According to the orbital altitudes, these satellites can be categorized into the high-(>2000km) and low-altitude ones (200∼2000km).Distinguished from low-altitude satellites, high-altitude ones cover a comparatively larger area with a longer orbital period.Thus resolutions of aerial photos captured by these high-altitude satellites are typically lower than the low-altitude ones.In practice, effectively understanding the semantic categories of these LR aerial photos is a useful technique in many computer vision tasks.For example, by periodically monitoring the geographical distribution of animals, forests, and swamps from an LR aerial photo, the biodiversity and wildlife trends can be well tracked.It is significant for keeping habitats inside their sanctuaries, especially for the endangered animals like pandas.Moreover, to optimize the planned path for long-haul driverless trucks, we have to accurately recognize the semantic categories of a variety of regions inside each LR aerial photo, based on which the shortest path between locations can be rapidly and dynamically calculated.
The associate editor coordinating the review of this manuscript and approving it for publication was Yizhang Jiang .
In computer vision, multiple categorization/annotation models have were designed to characterize aerial photos with mid/high resolutions (spatial resolution ≤10m).Representative work includes: 1) shallow/deep-learning-based object localization using weak labels [55], [56]; 2) graph models to enhance semantic propagation for aerial photo labeling [5], [6], [7]; and 3) carefully-designed hierarchical architectures for visual segmentation toward aerial photos [8], [9], [10].As far as we know, however, the existing approaches cannot effectively encode LR aerial photos due to two reasons: • Typically, there are tens of foreground objects within each LR aerial photo, as shown on the top of Fig. 1.
To calculate the semantics of an LR aerial photo, we expect a bionic model that simulates the process of human perceiving the foreground salient regions.Actually, building a deep model that can simultaneously extract the visually/semantically salient regions and engineer the deep features for these extracted regions is non-trivial.Potential challenges include: i) determining the sequence of humans observing the extracted salient regions (e.g., the path displayed in Fig. 1, 2) refining the contaminated labels of the training LR aerial photos, and 3) transferring image-level semantic labels into multiple regions inside an LR aerial photo; • Compared to HR aerial photos, LR ones are usually with an inferior image quality, as they are more sensitive to a variety of uncontrollable factors, e.g., the varying weather/lighting conditions and possibly communication interference.This brings a limited number of labeled LR aerial photos, coupled with a rich set of labeled HR ones.Thus, we expect a semi-supervised feature selector that is trained by partially-labeled LR aerial photos, which is an nontrivial task.Potential difficulties include how to uncover the underlying correlations among LR and HR aerial photos in the feature space.In this work, a so-called SPFS framework is formulated that adopts the deeply-learned perceptual experiences from HR aerial photos to enhance LR one categorization.Given a considerable quantity of HR and LR aerial photos, part of which are unlabeled.We first project their internal regions onto the feature space constructed based on discovering the visual and semantic channels collaboratively.Afterward, to mimick human visual perception, a deep low-rank model is designed to decompose each LR aerial photo into a sequence of visually/semantically salient foreground regions, i.e., gaze shifting path (GSP) coupled with the non-salient backgrounds, wherein the deep representation for each GSP is calculated simultaneously.Aiming at a concise set of discriminative features shared between HR and LR aerial photos, a SPFS algorithm selects a concise set of high quality features shared between LR and HR aerial photos, wherein only a small faction of labeled samples are required.Besides, SPFS can optimally preserve the graph structure of LR/HR aerial photos during feature selection.Finally, the selected features are integrated into a kernel SVM for LR aerial photo categorization.

II. RELATED WORK A. SEMANTICALLY MODELING AERIAL PHOTOS
Dozens of computational models were developed to analyze aerial photos.For visual modeling at image-level, Zhang et al. [57] constructed a novel topological feature to model the inter-region connection inside each aerial photo.And a kernel-induced vector is calculated as the image representation for categorization.Xia et al [59] formulated a weak model that semantically labels HR aerial photos at image-level.Akar et al. [60] proposed the so-called random forest and object-level feature extractor to classify each aerial image.The authors [62] developed a hierarchical CNN architecture for identifying the multiple labels of HR aerial photos describing many downtown areas.In [58], the authors utilized a deep model to classify remote sensing images.A domain-specific scenic picture set is leveraged to fine tune the deep architecture.In [43], a cross-modality learning framework is proposed to collaboratively learn five deep models for categorizing aerial images, wherein pixel-level and spatial-level features are exploited complementarily.Researchers [11] designed a multi-resolution model to learn the weights of aerial image features both horizontally and vertically.In [64], Bazi et al. formulated a vision transformer for aerial image classification, wherein the long-term contextual dependencies among regions can be intrinsically encoded.
For region-level modeling, Wang et al. [4] proposed a deep learning model for discovering salient objects in each aerial image.In [1], a focal loss deep architecture is proposed that optimally discovers vehicles from aerial images.In [63], The authors developed a learning model toward aerial photos by intelligently extracting intersections and streets.In [18], Yu et al. integrated feature enhancement and soft label assignment into an anchor-independent object detector toward aerial images.In [19], Wang et al. proposed a deep rotation-invariant detector that effectively estimates the angles of multi-scale objects inside aerial images.In [54], Chalavadi et al. proposed a parallel deep model called mSODANet that hierarchically learns contextual features from multi-scale and multi-FoV (field-of-views) ground objects.

B. SUPERVISED FEATURE SELECTION (FS)
In supervised FS, each feature's discrimination is quantized by its correlation with the labels.Nie et al. [15] formulated an effective FS algorithm by optimizing an objective function based on an l 12 -norm regularization.A fast and incremental FS framework particularly designed for high-dimensional features was formulated by [16].Gui et al. proposed an attention-guided feature scoring algorithm in a supervised setting.Based on an elaborately-designed smooth hinge loss, a sparsity-regularized model was proposed to obtain a subset of discriminative features.In [17], an l 12 -norm coupled with an exclusive lasso was incorporated for FS, wherein the redundant and contaminated features can be optimally abandoned.An effective measure was proposed for identifying discriminative features.In [14], Ahadzadeh et al. proposed a double-stage FS based on particle swarm optimization toward high-dimensional features.Stage one globally removes low quality features while stage two locally searches the highly discriminative ones.Noticeably, the above FS handle features in the original space, whereas practically the samples may be distributed in the high-order kernel space.Song et al. [30] designed a kernel-induced FS to maximize the correlation between the selected features and labels.In [31], Masaeli et al. proposed an HSIC-based implicit FS algorithm (a.k.a.feature transformation) using an l 1 /l ∞ -norm regularizer.Further, Yamada et al. [32] formulated the novel dual augmented Lagrangian in order to search for a global optimum.Researchers [33] proposed a kernel-induced feature selector that effectively acquires a subset of covariates that is most discriminative.In [34], Leng et al. extracted the features of both palmprints with 2D discrete cosine transform for constructing a dual-source space.And highly discriminative coefficients are optimally preserved for visual retrieval.Moreover, in [35], the standard cancelable palmprint coding is upgraded to 2D space.The so-called perpendicular orientation transposition and multi-orientation score level fusion collaboratively enhance the 2D cancelable palmprint codes.

III. OUR PROPOSED METHOD
An overview of our method is presented in Fig. 2. Our method involves three key components: deep low-rank algorithm for GSP calculation, the semi-supervised perceptual feature selection, and the SVM training.The inter connection between these components are annotated by the blue arrows.

A. DEEP LOW-RANK ALGORITHM FOR GSP LEARNING
In practice, there are multiple fine-grained objects inside each LR aerial photo.Biological studies [2] have shown that humans usually attend to a few salient objects in the visual cognition process.In our scenario, to understand each LR aerial photo, we typcially first attend to the ground salient regions, wherein the background regions are kept almost unprocessed.Such human visual perceptual behavior is informative for categorizing LR aerial photos.Herein, we propose a deep low-rank algorithm that sequentially selects salient image patches to construct gaze shifting paths (GSPs).And the corresponding deep features can be jointly engineered.
The theory of human visual perception indicates the high correlation (self-representativeness) of the non-salient background image patches inside each scenery.Contrastively, the foreground salient image patches are almost uncorrelated.This observation motivates us to decompose the feature matrix X ∈ R T ×N of each LR aerial photo into the salient and non-salient parts, where N counts the image patches within each LR aerial photo and T its feature dimensionality.Y ∈R T ×N preserves feature columns preserves feature columns corresponding to the non-salient background image patches (the other columns are all zeros).E ∈ R T ×N represents feature columns corresponding to the salient image patches (the other columns are all zeros).Aiming at a unique solution, multiple constrains are proposed to constrain Y and E. In our work, two observations are made.First, only a small fraction of image patches within each LR aerial photo are salient and will the detailedly processed by human vision system.This mathematically reflects that E is a sparse matrix.Second, the high correlation of the non-salient background image patches indicates that Y is a low-rank matrix.Based on these, we select the salient image patches by seamlessly integrating a sparsity and low-rankness constraint into (1): where ∥•∥ * is the matrix nuclear norm representing a convex approximation to matrix rank function, l 1 (E) quantizes the sparisty of E, f (ϒ,X) selects non-salient background image patches from each LR aerial photo, and l 2 (Y, f (ϒ,X)) penalizes the loss of non-salient background image patches selection.(ϒ) serves as a regularizer.α, β, and γ are positive coefficients balancing the trade-off among terms.More concretely, to ensure a highly sparse E, l 1 (•) is defined as: Practically, each element in matrix Y is nonnegative.Herein, we set l 2 (a, b) = (a − b) 2 /2 to calculate the image patches selection error.Thereby, the objective function (2) can be upgraded into: To precisely select the non-salient background image patches inside each LR aerial photo, we formualte a deep semantic model f (ϒ, X).It includes L layers of linear/nonlinear transformations.The deep representation 124740 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
from the top layer is denoted by h(x) and X i is the Tdimensional column feature vector from the i-th image patch.Meanwhile, the current layer's output is utilized as the input of the next layer.Mathematically, this can be represented as: where φ(•) denotes the activation function and g l (•) the l-th layer's output.Z l and ξ l represent the transformation matrix and the bias corresponding to the l-th layer respectively.The first layer's input is X i , based on which the first layer's output is calculated as: We want the deeply-learned feature h(X i ) sufficiently discriminative for selecting the non-salient background image patches.Without loss of generality, we adopt a linear mapping function to such selection process: where parameter set To mitigate overfitting, we design a regularizer to penalize model complexity.Herein, the regularization function (ϒ) is given as: By leveraging the definition in (3,8,9), the objective function (4) can be upgraded into: The above optimization is non-convex over all the variables.In our implementation, we follow the iterative algorithm in [3] to solve it.Thereafter, denoting Y * as (10)'s solution, the saliency score of the i-th image patch in an LR aerial photo is calculated by: where E * =X−Y * , and E * (V, i) denotes the i-th column of E * .A larger s(X i ) means that the i-th image patch is more visually/semantically salient.Given an LR aerial photo, we sequentially link the top K salient image patches to constitute its gaze shifting path (GSP).Thereby, the deep GSP representation is obtained by sequentially concatenating the deep features of its constituent K image patches.
During the deep low rank model learning, the loss function is the objective function in (10).The number of epochs is 200 and the learning rate is set to 0.005 and the entire deep network is pre-trained using the ResNet-152 [36].
Practically, the above deep GSP features might be inadequately discriminative.Herein, we expect to further obtain a subset of deep GSP features to enhance the subsequent visual categorization.Semi-supervised FS obtains high quality features by uncovering the binary relationships among labeled and unlabeled LR/HR aerial photos, which is suitable for our objective.
Without loss of generality, we assume that all the LR aerial photos are unlabeled while the entire HR ones are labeled.Denoting Q ∈ R D × C as the projection matrix for FS, a general FS can be formulated by minimizing the error: where the first term calculates the loss and the second one represents the regularizer.
We define the affinity graph E, wherein each entity E ij indicates the similarity between h i and h j .Herein, we simply set E ij = 1 if h i and h j are the K nearest neighbors, and E ij = 0 otherwise.Herein, the K nearest neighbors is built upon the standard KNN algorithm [66].KNN searches the K nearest samples to a reference one in the feature space, wherein K is determined by users.In our implementation, we set K=5.We set F as a diagonal matrix as F ii = N j=1 E ij .Afterward, we set T = F − E as the graph Laplacian.
To optimally exploit the entire samples, we define a predicted label matrix as × C toward the entire training samples by leveraging the transductive classification [65], where p i ∈ R C is the predicted category label of sample x i .In our solution, we enforce P maximally satisfy the smoothness of both ground-truth category label and the affinity graph.Mathematically, P is calculated using the following objective function: where X T Q−P denotes the loss function and R(Q) functions as a regularizer penalizing the projection matrix Q for optimal FS; σ ∈ [0, 1] and τ ∈ [0, 1] weight the loss function and regularizer respectively.
Due to the high sparsity and robustness of the nonconvexity, the l 1,p -matrix norm is applied to define regularizer R(Q) in our SPFS framework (p ∈ (0, 1]).In this way, the regularizer can be formulated as: In our implementation, we set p = 1/2.The solution of ( 14) is detailed in [21].In practice, if we set p to a different value, then the optimization might be non-convex and we cannot obtain a global optimal solution.

C. KERNEL-INDUCED FEATURE VECTOR FOR CATEGORIZATION
It is noticeable that the selected deep GSP features may be distributed on the high-order kernel feature space.Herein, a kernel-induced quantization method is employed to calculate each LR aerial photo's representation.For an LR aerial image, the object patches are extracted to build its GSP, which are simultaneously converted into deep GSP features for SPFS.Then, the selected deep GSP feature from the i-th LR aerial photo is accumulated into a kernel-induced vector where N counts the training LR/HR aerial photos and N counts the testing LR aerial photos.The j-th element of v i is calculated as: where −d J (•, •) computes the Euclidean distance between pairwise selected deep GSP features.Given N testing LR aerial photos, we can obtain an N × N kernel matrix at the training stage and an N×N ′ kernel matrix at the testing stage.The first matrix is utilized to learn a classifier for LR aerial photo classification, while the second one is employed for testing.

IV. EXPERIMENTS
We validate our LR aerial photo categorization using four experiments.We first introduce our self-compiled image set, which includes >3.7 million LR/HR aerial photos collected from the top 100 metropolises from different continents.Based on this, we compare our approach with 17 state-of-theart deep categorization models from three perspectives: accuracy, stability, and time consumption.Thereafter, we evaluate our categorization accuracy by adjusting the multiple inherent parameters, based on which the optimal parameters are suggested.Lastly, we design an ablation study to evaluate each key module in our SPFS-based LR aerial photo categorization pipeline.Simultaneously, we visualize a set of attractive image patches selected by our SPFS-based FS.

A. KERNEL-INDUCED FEATURE VECTOR FOR CATEGORIZATION
To comprehensively evaluate the our categorization model, we have to experiment on a massive-scale LR/HR aerial photo set from many categories.To our best knowledge, however, there is no such data set.We spend enormous efforts to compile a huge data set containing over 3.6 million LR/HR aerial photos.The sources of these LR/HR aerial photos are Google/Apple/Bing Maps, based on which we design a crawler software that spent 4310 hours to search and download LR/HR aerial photo.More specifically, we use the name of 100 most popular metropolitan cities (as detailed in Fig. 3) throughout the world as the keywords to search Google/Apple/Bing Maps.In total, there are 46 cities from North America, 38 from Europe, ten from Asia, four from Oceania, and two from South America.Subsequently, we crop LR/HR aerial photos from the cached maps, wherein the typical resolutions of HR aerial photos are between 5K × 5K and 22K × 22K .In our implementation, we restrict the HR aerial photos' resolution upper bound to 22K × 22K .Meanwhile, the resolutions of LR aerial photos are between 0.35K ×0.35K and 2K ×2K .We adopt these settings because: 1) we want to make each HR aerial photo associated with four categories mostly, 2) we enforce that there are maximally 5% overlapping areas between any pairwise LR/HR aerial photos, and 3) too few pixels inside an LR aerial photo will make it technically impossible to perceive its semantics.During our data set compilation, we notice that a few LR/HR aerial photos are blurred due to bad weathers or sensitive military regions, as exemplified in Fig. 4. Actually, our method focuses on discovering object patches with different scales and subsequently learn deep perceptual features for visual categorization.Practically, bad weathers will decrease the visibility of LR/HR aerial photos and in turn hurt the fairness of categorization accuracy comparison.Therefore we abandon LR/HR aerial photos whose 20% pixels are unclear, wherein the clearness is measured by the blur estimation algorithm proposed by Tong et al. [47].To quantitatively show the effectiveness of the above refining process, we use the IQA (image quality assessment) algorithm [48] to calculate the quality scores of LR/HR aerial photos in our data set.As reported in Fig. 5, over 74% of our refined LR/HR aerial photos are scored over 0.7.After collecting the million-scale LR/HR aerial photos, we have to annotate them to obtain the corresponding category labels.Herein, 106 volunteers first manually annotate 23.8% HR aerial photos in each metropolitan city, wherein a total of 47 different category labels were utilized.Afterward, we train a multi-label SVM and employ it to annotate the category labels of the rest LR/HR aerial images.Then, the same 106 volunteers manually correct the labels calculated by SVM.It is noticeable that multiple category labels are associated with intolerably small number of LR/HR aerial photos.This makes it infeasible to train a generalizable categorization model corresponding to these category labels.In our implementation, if the number of LR/HR aerial photos corresponding to a category label is smaller than 200,000, Then we abandon this label.In this way, we finally obtain 18 different category labels as detailed in Table 1.Thereafter, we notice that 99.983% LR/HR aerial photos have fewer than four category labels, while the rest very few LR/HR aerial photos have larger numbers of category labels (from five to 15).These LR/HR aerial photos usually contain a rich set of small regions (< 200 × 200) that are possibly contaminated.Thus we simply abandon them.Lastly, we order the entire LR/HR aerial photos by their file names.The entire HR aerial photos are employed for training.For each category, the first half LR aerial photos constitute the training set while the rest are employed for testing.

B. COMPARATIVE STUDY 1) ACCURACY COMPARISON
Herein, we test our LR aerial image classification by evaluating its effectiveness and efficiency with a bunch of competitors.We first conduct a comparative study with seven deep categorization models [23], [24], [25], [26], [27], [28], [29] that intrinsically encode some prior knowledge of different aerial photo categories.We notice that the source codes of [23], [24], [27], and [28] are publicly available.Thereby, we conduct comparative study wherein the parameter settings are set as default.For [25], [26], and [29], the source codes are unavailable to our knowledge.In this way, we re-implement them.We have tried our best to make the re-implemented models perform similarly to the results reported in their publications.
Moreover, we compare our method with [67] and [68].We observe that our method outperforms those aerial image classifiers not specifically designed for LR aerial photo categorization.Besides, [67] and [68] cannot encode auxiliary information from HR aerial photos.Thus their performances are inferior.
For the categorization models implemented by ourselves, the experimental setups are briefed as follows.In [25], we utilize the ResNet-152 [36] as the backbone, which is subsequently upgraded into a multi-label variant.Except for the last fully-connected layer (unit number is fixed at 13), while the remaining layers are pre-retained using the ResNet learned from ImageNet [53].For [26], the weights in the 1536-D LSTM layer are calculated using a random number.For [29], the domain adaptation is implemented from the RSSCN7 [28] to our compiled LR\& HR aerial photo set.The ResNet101V2 [36] is employed as the backbone and the stochastic gradient descent optimizes the entire deep model.The network loss is calculated by the mean squared error.For [22], we retrain the object bank [51] based on our refined 18 LR/ HR aerial photo categories, wherein the average-pooling strategy is applied.We employ liblinear to solve the SVM classifier, wherein the 7-fold cross validation is utilized.
For the above 18 compared object/scene classification algorithms, we repeatedly test each model ten and the results are displayed in Table 2. To quantify the stability of these categorization models, we report their standard errors simultaneously.We observe that the per-category standard errors produced by our method are significantly and consistently lower than its competitors.This demonstrated that our method is the most stable.In summary, the following conclusions can be made: 1) Our method outperforms the other aerial photo categorization models remarkably due to three reasons.

2) TRAINING/TESTING TIME COMPARISON
It is generally acknowledged that time consumption is a key criterion reflecting the performance of a classification algorithm.Then, we report the training and testing time of the aforementioned 18 aerial photo categorization models.
As shown in Table 3, during training, only two baseline models are faster than our pipeline.This is because the architectures of [45], [52] are much simpler than ours.Meanwhile, we observe that the per-category accuracies of [45], [52] are noticeably lower than ours.For the testing time comparison, our method can be conducted at a significantly faster speed than all the baseline methods.
Notably, distinguished from model training that can be conducted offline, outstanding testing time is comparably more valuable to many time-sensitive AI systems, such as weather forecasting and automatic navigation.
Our LR aerial photo pipeline involves key modules: GSP using the low-rank algorithm, CPKP-based and 3) feature classification for category labels.During training, the time consumed for is: 9h12m (m1), 10h11m and 3h58m During testing, the time cost of each module is: 77ms (m1), 3ms (m2), and 12ms (m3).We observe that most of the training time is spent for module 1 and practically this can be accelerated by Nvidia GPUs.

C. EVALUATION BY TUNING PARAMETERS
There are two sets of tunable parameters to be evaluated.The first set denotes the weights balancing multiple attributes in the low-rank algorithm, i.e., α, β, and γ , as well as the deep layer number L. The second set includes the polynomial kernel degree Q and the target dimensionality for CPKP-based FS V .Herein, we report the LR aerial photo categorization accuracy by adjusting the two sets of parameters.
To analyze the first set of parameters, we set the default values of α, β, γ and L to 0.3, 0.1, 0.15, and 7 respectively.In our implementation, the default values are determined by 10-fold cross validation.Herein, the validation set contains 54000 samples, which is constituted by selecting 3000 LR aerial photos from each category.More concretely, we tune each of α, β, and γ from 0.05 to one with a step of 0.05.And all the possible parameter combinations are enumeratively employed to test the LR aerial photo categorization.The parameter combination receiving the highest categorization accuracy is reported as the default values.Based on this, we adjust one of the three parameters while keep the others unchanged.Each parameter is increased from 0.05 to one with step of 0.01, wherein the corresponding categorization accuracy is reported.As the three curves displayed on the top of Fig. 6, the three parameters consistently increase stably and then peak.Afterward, they all decrease to a low level.Such monotonicity properties indicate the feasibility to tune the three parameters toward an optimal level in practice.As shown at the bottom of Fig. 6, the accuracy increases stably when L is increased from one to 7. Thereafter, the accuracy maintains stably.We notice that a deeper categorization model indicates more parameters to be learned, which may cause model overfitting practically.Thus we set L = 7.

D. ABLATION STUDY
As aforementioned, our method is comprised of two key modules: 1) GSP learning using the low-rank algorithm, 2) semi-supervised FS.Herein, we test the importances of these modules in our HR aerial photo categorization pipeline.Specifically, each module is replaced by a different one.Then the performance decrement/increment is presented.Also, insights are provided to elaborate the underlying reasons for the observed results.In the first place, to evaluate the effectiveness of the low-rank algorithm, two experimental settings are deployed.We first abandon the sparse constraint term ∥E∥ 1 in (10) (marked by ''S11'').Afterward, we abandon the regularizer ) in (10) (marked by ''S12'').We report the variation of categorization accuracy in Table 4. Herein, the intersection of column ''Si'' and row ''Oj'' denotes the setup ''Sij''.Noticeably, abandon the regularizer converts the deep feature learning to a shallow one.And a shallow feature engineering module will cause a sharp performance decrement.Also, removing the sparse constraint will greatly decrease the accuracy.This observation shows the necessity to mitigate the overfitting of our designed low-rank algorithm.Next, to evaluate the performance of the geometry-preserving FS, we remove such function and use the full feature set for LR aerial image categorization (S21).Then, we remove the two terms σ X T Q − P (S22) and τ R(Q) (S23) respectively and report the categorization accuracies.As shown in 4, abandoning the FS module causes the largest categorization accuracy drop.This demonstrates the importance of feature selection in LR aerial image categorization.

V. CONCLUSION
Recognizing aerial images is an indispensable application in deep neural networks [9], [37], [38], [39], [40], [41], [62],.We proposed a novel LR aerial photo categorization pipeline, wherein deep perceptual features are extracted and refined by propagating the prior knowledge of HR aerial photos into LR ones.Our work includes three key modules: 1) a deep low-rank algorithm that learns deep features from LR/HR aerial images; 2) novel SPFS-based that selects high quality on the feature and 3) a kernel SVM that learned from the selected features.Experiments shown the competitiveness of our approach.
One shortcoming categorization is that the feature is conducted in the original feature space, whereas practically the deep GSP features might be distributed in the nonlinear high-order feature space.In the future, we plan to design a high-order feature selection algorithm to further enhance the quality of the selected features.

FIGURE 1 .
FIGURE 1. Top: salient aerial image patches sequentially observed by humans (marked by path A → • • • → G) as well as the blurred playground (marked by red dashed box).Bottom: three HR aerial photos capture sub-regions of the LR aerial photo (the middle details the blurred playground inside the LR aerial photo).

FIGURE 2 .
FIGURE 2. The pipeline of the LR aerial photo categorization by our designed SPFS framework.Our method first projects regions from HR/LR aerial photos into the feature space, based on which the deep low-rank algorithm is used to extract GSPs and generate the deep features accordingly.Then the SPFS is leveraged to select highly discriminative features, which are subsequently fed into the multi-class SVM for visual categorization.
We denote as the feature matrix of the D-dimensional deep GSP features from both LR and HR aerial photos during training.The first M rows correspond to the M labeled HR aerial photos while the succeeding rows correpond to the unlabeled LR ones.N is the total number of training samples.Similarly, we denote L= y 1 , • • • ,y M ,y M+1 , • • • ,y N ∈ {0, 1} N×C as the label matrix of the training LR/HR aerial photos, wherein C counts the semantic categories.Herein, y ij represents the j-th category label of y i (1≤ i ≤ C).We set y ij = 1 if the i-th sample belonging to the j-th category, and y ij = 0 otherwise.Meanwhile, if the i-th sample is unlabeled, we simply set y i as a C-dimension row vector with all zeros.

FIGURE 3 .
FIGURE 3. The number of HR/LR aerial images selected by us.

FIGURE 5 .
FIGURE 5. Statistics of LR/HR aerial photos with different quality scores in our complied LR/HR aerial photo set.

TABLE 1 .
The selected 18 categories and the corresponding LR& HR aerial photo numbers.

TABLE 2 .
Accuracies with Standard Errors of the 19 Categorization Models (Experiments are repeated 20 times).

TABLE 3 .
Training/testing Time of the 18 Categorization Models (Each Bold Number Represents the Best Result).