Perceptual Low-Rank Learning and Geometry-Preserving Feature Selection for Categorizing High-Resolution Aerial Photos

Recognizing the multiple categories of an high-resolution (HR) aerial photos is an indispensable technique in geoscience and remote sensing. In this work, a perceptual low-rank algorithm combined with a geometry-preserving feature selection (FS) is proposed for categorizing HR aerial photos. In practice, the theory of human visual perception indicates that for each scenery, the background non-salient regions are highly correlated, whereas the foreground visually/semantically salient regions are almost uncorrelated. Motivated by this, we design a novel low-rank algorithm that seeks a sparse set of foreground visually/semantically salient image patches. These patches are sequentially linked into a so- called GSP (path reflecting gaze movement) to mimick human vision system. Afterward, a geometry-preserving FS algorithm is proposed to select highly discriminative features from the aforementioned gaze features, wherein a classifier can be trained simultaneously. Comprehensive experimental validation on our Internet-scale image set have shown its superiority.


I. INTRODUCTION
Thanks to the technology of delivering several satellites by a single rocket launch, many earth observation satellites have been launched in the past decades.These satellites capture HR aerial images containing ground objects with sophisticated spatial structures.Understandings the semantics of the ground objects by exploiting the inherent spatial structures becomes a useful tool in lots of artificial intelligence applications.
In image processing, plenty of image/video classification/parsing models were designed to encode aerial photos.Important work includes: 1) multiple instance learning/convolutional neural network-based object localization using weak labels; 2) graph model for semantically exploiting aerial photographs; and 3) well-designed deep models to semantically annotate aerial photographs.Nevertheless, as far The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Shorif Uddin .as we know, the existing models are all sub-optimal characterize HR aerial photo because of the following reasons: • Actually, each HR photo usually has many ground objects with complicated spatial layouts.Intelligently exploiting their underlying semantics is difficult.The inherent challenges include: i) discovering those ground visually/semantically salient objects according to human visual perception (as the circles exemplified on the left of Fig. 1), and ii) how to design a model that converts the discovered salient objects into a fixed-length feature vector, which can be utilized for the subsequent feature classification; • Biological studies have shown that humans sequentially perceive different regions within each scenery.As shown on the right of Fig. 1, humans will first attend to the upper central residential area, and then shift the gazes to the right one, and so on.In practice, the path reflecting human gaze movement is highly descriptive to categorize HR aerial photos.But designing a principled model extracting GSPs from HR aerial photos with different spatial layouts remains unsolved.To handle these challenges, we propose a biologicallyinspired HR aerial photo categorization framework.Our key contribute is a low-rank algorithm associated with a geometry-preserving feature selector that jointly: 1) extracts multiple visually/semantically salient image patches sequentially to constitute a GSP for each HR aerial photo, and 2) obtain a subset of highly discriminative GSP features for the subsequent visual categorization.More specifically, as elaborated in Fig. 2, by collecting a considerable quantity of HR aerial images, we first project their internal regions onto the feature space constructed by exploiting the visual and semantic channels collaboratively.Thereafter, to mimick human visual perception, a low-rank model is designed to decompose each HR aerial photo into a sequence of visually/semantically salient foreground image patches coupled with the non-salient background ones.Accordingly, the saliency value of each salient image patch can be calculated, which guides the GSP feature generation.Toward a subset of high quality GSP features, we further propose a geometrypreserving FS algorithm to obtain highly discriminative GSP features, coupled with a classifier trained from the selected GSP features.Noticeably, our designed feature selector can maximally preserve the sample distribution in the feature space during FS.This attribute is significant to ensure the discrimination of the selected features according to the manifold learning theory [44].Such classifier is finally utilized to calculate the category labels of each HR aerial photo.Experimental comparison with over ten state-of-theart shallow/deep categorization models has demonstrated the superiority of the proposed approach.
In summary, our method has two main contributions: 1) a novel low-rank algorithm that extracts many GSPs from each HR aerial photo and engineers GSP's visual feature simultaneously, and 2) a geometry-preserving FS algorithm to obtain highly discriminative GSP features and train the classifier for HR aerial photo categorization.

II. RELATED WORK
In the literature, dozens of computational aerial image models were developed to analyze aerial photos. 1 Some models are conducted at image-level.Zhang et al. [2] constructed a novel topological feature to model the inter-region connection inside each aerial photo.And a kernel-induced vector is calculated as the image representation for categorization.Xia et al. [3] formulated a novel weak model which can semantically label HR aerial photos at image-level.Akar et al. [4] seamlessly combined the so-called rotation forest and object-level feature extractor to categorize a rich set of aerial images to different classes.The authors [5] developed a hierarchical deep architecture to recognize the multiple labels of HR aerial photos describing many downtown areas.In [6], researchers utilized a hierarchical and multi-layer deep model to classify HR aerial photos.A domain-specific scenic picture set is leveraged to fine tune the deep architecture.In [7], a cross-modality learning framework is proposed to collaboratively learn five deep models for categorizing aerial images, wherein pixel-level and spatial-level features are exploited complementarily.Cai and Wei [8] proposed a cross-attention mechanism to learn the weights of aerial image features both horizontally and vertically.In [9], Bazi et al. formulated a vision transformer for aerial image classification, wherein the long-term contextual dependencies among regions can be intrinsically encoded.Although impressive performance have been achieved by the above methods, they cannot handle HR aerial photo categorization effectively because of three reasons: 1) the region-level visual features are particularly informative for HR aerial photo modeling, but they cannot be well encoded; 2) these methods cannot explicitly incorporate human gaze behavior into the categorization model.Thus, the predicted semantic labels might be inconsistent with human visual cognition; and 3) these methods are usually insufficiently fast since a rich number of highly time-consuming features have to be extracted.
For region-level modeling, the authors [10] designed an enhanced and multi-layer neural network to discover multi-scale attractive objects within an aerial image.In [11], a focal loss deep architecture is proposed that optimally discovers vehicles from aerial images.In [12], researchers developed a novel object localization algorithm toward remote sensing images.It intelligently extracts intersections as well as streets.In [13], Yu et al. integrated feature enhancement and soft label assignment into an anchor-independent object detector toward aerial images.In [14], Wang et al. proposed a deep rotation-invariant detector that effectively estimates the angles of multi-scale objects inside aerial images.In [15], Chalavadi et al. proposed a parallel deep model called mSODANet that hierarchically learns contextual features from multi-scale and multi-FoV (field-of-views) ground objects.Notably, compared to imagelevel modeling, region-level models can exploit the regional features to facilitate HR aerial photo categorization.But there still some shortcomings: 1) the aforementioned region-level models are generally dataset-independent, which cannot be conveniently applied cross different datasets.Practically, however, we need a principled region-level image model that is applicable across multiple image sets; and 2) the human visual perception fails to be efficiently encoded by these models.Actually, we want an HR aerial photo processing system that can rapidly recognize each HR aerial photo.
In machine learning, the low-rank algorithm [16] has been pervasively used in seeking a succinct set of bases for representing a large-scale samples, i.e., each sample can be represented by a linear combination of the bases.Low-rank algorithm can be used in applications like information retrieval, recommendation systems [17], and feature extraction [17].In our work, we use low-rank approximation to represent the entire regions within each aerial image by a set of visually/semantically salient regions.This can be deemed as a novel visual feature extractor.Meanwhile, geometry-preserving feature selection (FS) attempts to obtain a few highly discriminative features from the original highdimensional ones.During the FS process, the geometry distribution among samples is maximally preserved.This technique is widely used for face/speech recognition [18] and image retrieval [19].Typically, geometry-preserving FS can significantly enhance AI systems' efficiency by reducing the number of extracted features.

III. OUR PROPOSED METHOD A. LOW-RANK ALGORITHM FOR GSP LEARNING
In practice, there are multiple fine-grained objects inside each HR aerial photo.Biological studies [20], [21], [41] have shown that observers practically attend to a succinct set of salient objects.In our scenario, when humans perceive an LR aerial photo, their eye will first fix onto the ground attractive regions.Meanwhile, the unattractive background regions are kept almost unprocessed.Such human visual perceptual behavior is informative for categorizing HR aerial photos.Herein, we propose a low-rank algorithm that sequentially selects salient image patches to construct gaze shifting paths (GSPs).And the corresponding visual features can be jointly engineered.
The theory of human visual perception indicates the high correlation (self-representativeness) of the non-salient background image patches inside each scenery.Contrastively, the foreground salient image patches are almost uncorrelated.This observation motivates us to decompose the feature matrix X ∈ R T ×N of each HR aerial photo into the salient and non-salient parts, where N counts the image patches within each HR aerial photo and T its feature dimensionality.Y ∈ R T ×N preserves feature columns corresponding to the non-salient background image patches (the other columns are all zeros).E ∈ R T ×N represents feature columns corresponding to the salient image patches (the other columns are all zeros).Aiming at a unique solution indicating the salient image patches, some criteria are proposed to constrain Y and E. In our work, two observations are made.First, only a small fraction of image patches within each HR aerial photo are salient and will the detailedly processed by human vision system.This mathematically reflects that E is a sparse matrix.Second, the high correlation of the non-salient background image patches indicates that Y is a low-rank matrix.Based on these, we select the salient image patches by seamlessly integrating a sparsity and low-rankness constraint into (1): (2) where || • || * is the matrix nuclear norm representing a convex approximation to matrix rank function, l 1 (E) quantizes the sparisty of E, f ( , X)) selects non-salient background image patches from each HR aerial photo and contains the inherent parameters, and l 2 (Y, f ( , X)) penalizes the loss of non-salient background image patches selection.( ) serves as a regularizer.α, β, and γ are parameters measuring the importance of these terms.More concretely, to ensure a highly sparse E, l 1 (•) is defined as: Noticeably, each entity of Y is nonnegative.Herein, we set l 2 (a, b) = (a − b) 2 /2 to calculate the image patches selection error.Thereby, objective function (2) can be upgraded into: It is observable that ( 4) is a non-convex optimization over the entire variables.In our implementation, we follow the iterative algorithm in [42] to solve it.Thereafter, denoting Y * as the optimal solution of (4), the saliency score of the i-th image patch in an HR aerial photo is calculated by: where E * = X − Y * , and E * (:, i) denotes the i-th column of E * .
The GSP learning is given as follows.A larger s(X i ) in Eq.( 5) means that the i-th image patch is more visually/semantically salient.Given an HR aerial photo, we sequentially link the top P salient image patches to constitute its gaze shifting path (GSP).Accordingly, the visual feature of the GSP is obtained by sequentially concatenating the visual features of its constituent P image patches.In the following, the GSP feature from the i-th training HR aerial photo is denoted by g i .

B. GEOMETRY-PRESERVING FEATURE SELECTION 1) SAMPLES GEOMETRY PRESERVATION 2
Inspired by the recent progresses in manifold learning, the self-expressive model is leveraged to preserve the sample distribution during feature selection (FS).The self-expressive model hypothesizes that the entire samples are distributed on a combination of subspaces.Mathematically, each sample can be linearly represented by a constrained combination of the other samples, i.e.G = GT and diag(T) = 0. Herein, is a matrix consisting of the GSP features from N training samples, T denotes the matrix containing the self-reconstruction parameters.Practically, we notice that the samples might be contaminated.Thus, the self-expressive model can be upgraded into: G = GT + J, wherein J is the error matrix.Based on these, the general form self-expressive model is given as: min where ||•|| u and ||•|| v denote two pre-specified matrix norms, τ 1 and τ 2 are the two corresponding nonnegative weights, and cons(T, J) represents the constrains on T and J.

2) OBJECTIVE FUNCTION OF OUR FS
We denote K as a matrix projecting the original GSP features into the low-dimensional one.In practice, K is constrained to be a column-wise sparse matrix for FS.Then, we can assume that if sample g i and g j are from the same category, then the low-dimensional selected feature Kg i and Kg j should be close and the weight H ij should be large.Herein, H = |T| + |T T | denote the weight matrix measuring the entire samples.Meanwhile, if samples g i and g j are from different categories, then the distance between Kg i and Kg j should be far and the weight H ij will be close to zero.Mathematically, the above observations can be formulated into the following objective function: where ⊙ is the Hadamard product, l i denotes the i-th category labels to the i-th sample, L is a matrix comprising of category labels of the entire training samples.
and α is a trade-off parameter between zero and one.
By combining ( 6) and ( 7), the objective function of our FS and be reorganized into: min K,L,T,J In our method, the l 1 -norm is employed for both ||T|| u and ||J|| v .The l 12 -norm ensures the column-wise sparsity of K. Based on the constrains detailed above (6), objective function ( 8) can be updated into: where τ 3 denotes another nonnegative weight.In our implementation, the solution is based on [43].
Based on the H calculated from T in ( 9), the category labels l * of a new sample is derived by: arg min where R = diag(HI N +1 ) − H is the graph Laplacian matrix, I N +1 is an (N + 1) × (N + 1)-sized identity matrix, and

IV. EXPERIMENTAL RESULTS AND ANALYSIS
We validate the effectiveness and efficiency of our HR aerial photo categorization using four experiments.We first introduce our self-compiled data set, which includes million-scale LR(low resolution)&HR aerial photos collected from the top 100 metropolises from different continents.Subsequently, we compare our approach with 17 state-ofthe-art deep categorization models from three perspectives: accuracy, stability, and time consumption.Then, we evaluate our categorization accuracy by adjusting the multiple inherent parameters, based on which the optimal parameters are suggested.Lastly, we design an ablation study to evaluate each key module in our HR aerial photo categorization pipeline.

A. DATA SET DESCRIPTION
To comprehensively evaluate our categorization model, we have to experiment on a massive-scale LR&HR aerial photo set from many categories.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
in Fig. 3) throughout the world as the keywords to search Google/Apple/Bing Maps.In total, there are 46 cities from North America, 38 from Europe, ten from Asia, four from Oceania, and two from South America.Subsequently, we crop LR&HR aerial photos from the cached maps, wherein the typical resolutions of HR aerial photos are between 5K × 5K and 22K × 22K .In our implementation, we restrict the HR aerial photos' resolution upper bound to 22K × 22K .Meanwhile, the resolutions of LR aerial photos are between 0.35K × 0.35K and 2K × 2K .We adopt these settings because: 1) we want to make each HR aerial photo associated with four categories mostly, 2) we enforce that there are maximally 5% overlapping areas between any pairwise LR&HR aerial photos, and 3) too few pixels inside an LR aerial photo will make it technically infeasible to perceive its semantics.
During our data set compilation, we notice that a few LR&HR aerial photos are blurred due to bad weathers or sensitive military regions, as exemplified in Fig. 4. Actually, our method focuses on discovering object patches with different scales and subsequently learn visual perceptual features for visual categorization.Practically, bad weathers will inevitably decrease the visibility of LR&HR aerial photos and in turn hurt the fairness of accuracy comparison.Therefore we abandon LR&HR aerial photos whose 20% pixels are unclear, wherein the clearness is measured by the blur estimation algorithm proposed by Tong et al. [22].To quantitatively show the effectiveness of the above refining process, we use the IQA (image quality assessment) algorithm [23] to calculate the quality score of each LR&HR aerial photo in our data set.More specifically, the single image quality assessment module in [23] is adopted here to calculate the quality score (with normalization).Herein, the quality score used in our implementation is a normalized image quality score.We manually select K best quality HR aerial image with sharpness scores Then, for a new HR aerial image with sharpness score Q new , its quality score is calculated as: As reported in Fig. 5, over 74% of our refined LR&HR aerial photos are scored over 0.7.
After collecting the million-scale LR&HR aerial photos, we have to annotate them to obtain the corresponding category labels.Herein, 106 volunteers3 first manually annotate 23.8% HR aerial photos in each metropolitan city, wherein a total of 47 different category labels were utilized.Afterward, we train a multi-label SVM and employ it to annotate the category labels of the rest LR&HR aerial images.Then, the same 106 volunteers manually correct the labels calculated by SVM.It is noticeable that multiple category labels are associated with an intolerably small number of LR&HR aerial photos.This makes it infeasible to train a generalizable categorization model corresponding to these category labels.In our implementation, if the number of LR&HR aerial photos corresponding to a category label is smaller than 200,000, Then we abandon this label.In this way, we finally obtain 18 different category labels as detailed in Table 1.Thereafter, we notice that 99.983% LR&HR aerial photos have fewer than four category labels, while the rest very few LR&HR aerial photos have larger numbers of category labels (from five to 15).These LR&HR aerial photos usually contain a rich set of small regions (< 200 × 200) that are possibly contaminated.Thus we simply abandon them.Lastly, we order the entire LR&HR aerial photos by their file names.The entire HR aerial photos are employed for training.For each category, the first half HR aerial photos constitute the training set while the rest are employed for testing.The entire LR aerial photos are employed for model validation, since manually detecting objects on LR aerial photos is much more convenient than the HR ones.

B. PERFORMANCE COMPARISON
Herein, our method is compared with seven deep categorization models [24], [25], [26], [27], [28], [29], [30] that intrinsically encode some prior knowledge of different aerial photo categories.We notice that the source codes of [24], [25], [28], and [29] are publicly available.Thereby, we conduct a comparative study wherein the parameter settings are set as default.For [26], [27], and [30], the source codes are unavailable to our knowledge.Due to this reason, these baseline methods are re-implemented by software programmer.We tried our best so as to make the re-implemented models perform similarly to the results reported in their publications.Nowadays, many deep generic recognition models perform impressively on categorizing aerial photos.Herein, our method is first made a comparison with multiple deep generic object recognition models: the pyramid pool-CNN (S-CNN) [31], CNet [32], discrimination filtering bank algorithm (DFBA) [33], C-RNN [34], multilabel graph convolutional network (MLT) [35], semanticspecific graph model (SGM) [36] and multi-label transformer model (MTM) [37].Furthermore, since HR aerial photo categorization can be deemed is a sub-topic of scenery classification, we additionally compare with three well-known scenery classification models [38], [39], [40].
For the above baseline object/scene recognition algorithms, each model is repeatedly tested multiple times and the results are displayed in Table 2.As shown, our method achieve the best per-category accuracies on the entire 18 categories.To quantify the stability of these categorization models, we report their standard derivations simultaneously.We observe that the per-category standard derivations produced by our method are significantly and consistently lower than its competitors.This demonstrated that our method is the most stable.

1) TRAINING/TESTING TIME COMPARISON
It is generally acknowledged that time consumption is a key criterion reflecting the performance of a categorization model.Herein, we report the training and testing time of the aforementioned 18 aerial photo categorization models.As shown in Table 3, during training, only two baseline models are faster than our pipeline.This is because the architectures of [31] and [35] are much simpler than ours.Meanwhile, we observe that the per-category accuracies of [31] and [35] are noticeably lower than ours.For the testing time comparison, our method can be conducted at a significantly faster speed than the baseline methods.
Notably, distinguished from model training that can be conducted offline, outstanding testing time is comparably more valuable to many time-sensitive AI systems, such as weather forecasting and automatic navigation.
Our HR aerial photo categorization pipeline involves three key modules: 1) GSP learning using the low-rank algorithm, 2) geometry-preserving FS.During training, the time consumed for each module is: 9h12m (module 1) and 1h51m (module 2).During testing, the time cost of each module is: 71ms (module 1) and 11ms (module 2).We  observe that most of the training time is spent for module 1 and practically this can be accelerated by Nvidia GPUs.

C. PARAMETER ANALYSIS
We first evaluate three weights in our low-rank algorithm.Parameter α, β, γ and L's default values are fixed to 0.3, 0.1, and 0.15 respectively.In our implementation, the default values are determined by 10-fold cross validation.The validation set contains 54000 samples, which is constituted by selecting 3000 HR aerial photos from each of the 18 categories.More concretely, we tune each of α, β, and γ from zero to one.And all the possible parameter combinations are enumeratively employed to test the HR aerial photo categorization.The parameter combination receiving the highest categorization accuracy is reported as the default values.Based on this, we adjust one of the three parameters while keep the others unchanged.Each parameter is increased from zero to one.We then report the accuracy accordingly.As the three curves displayed on the left of Fig. 6, the best performances are achieved when α = 0.1, β = 0.15, and γ = 0.3.
To evaluate the influences of τ 1 , τ 2 , and τ 3 in our geometrypreserving FS, we set τ 1 = τ 2 and then tune τ 1 and τ 3 .Then we follow the experimental settings described above.τ 1 and τ 3 's initial values are both set to 0.45.As the two curves shown in Fig. 6, the highest performance is observed if τ 1 = 0.4 and τ 3 = 0.6.

D. ABLATION STUDY
As aforementioned, our method is comprised of two key modules: 1) GSP learning using the low-rank algorithm, 2) geometry-preserving FS.Herein, we test the importances of these modules in our HR aerial photo categorization pipeline.Specifically, each module is replaced by a different one.Then the performance decrement/increment is presented.Also, insights are provided  to elaborate the underlying reasons for the observed results.
In the first place, to evaluate the effectiveness of the low-rank algorithm, two experimental settings are deployed.We first abandon the sparse constraint term ||E|| 1 in (4) (marked by ''S11'').Afterward, we abandon the regularizer 2 ) in (4) (marked by ''S12'').We report the variation of categorization accuracy in Table 4. Herein, the intersection of column ''Si'' and row ''Oj'' denotes the setup ''Sij''.Noticeably, a shallow feature engineering module will cause a performance decrement.Also, removing the regularizer will greatly decrease the accuracy.This observation shows the necessity to mitigate the overfitting of our designed low-rank algorithm.Next, to evaluate the performance of the geometry-preserving FS, we remove τ 1 ||T|| 1 , τ 1 ||J|| 2 , and τ 1 ||K|| 12 respectively.As shown in 4, abandoning the geometry-preserving term causes the largest categorization accuracy drop.This demonstrates the importance of maintain sample distribution in FS.

V. CONCLUSION
Recognizing aerial images is an indispensable task in geoscience and remote sensing [45], [46], [47], [48], [49].We proposed a novel HR aerial photo categorization model, wherein the key is a low-rank algorithm as well as a geometry-preserving FS.The comparative study on our complied million-level HR aerial photo set has shown the competitiveness of our method.
One limitation of our work is the low-rank algorithm has a shallow architecture.Currently deep models have been pervasively used in visual categorization since they can produce more descriptive features.In the future, we plan to upgrade our low-rank algorithm into a deep architecture toward a more descriptor feature extractor.

FIGURE 1 .
FIGURE 1. Left: human gaze shifting paths (GSPs) from five observers (arrows with different colors); right: a GSP calculated using the proposed method.

FIGURE 2 .
FIGURE 2. An overview of our proposed HR aerial photo categorization pipeline.

FIGURE 3 .TABLE 1 .
FIGURE 3. The statistics of LR&HR aerial images collected from the 100 metropolitan cities.TABLE 1.The selected 18 categories and the corresponding LR&HR aerial photo numbers.

FIGURE 5 .
FIGURE 5. Statistics of LR&HR aerial photos with different quality scores in our complied LR&HR aerial photo set.
To our best knowledge, however, there is no such data set in the literature.
In this work, we spent enormous efforts to compile a huge data set containing over 3.6 million LR&HR aerial photos.The sources of these LR&HR aerial photos are Google/Apple/Bing Maps, based on which we designed a crawler software that spent 4310 hours to search and download LR&HR aerial photo.Specifically, we use the name of 100 most popular metropolitan cities (as detailed 112432VOLUME 11, 2023

TABLE 2 .
Performances with deviations of the aforementioned image recognizers (The highest accuracies are shown in bold numbers).

TABLE 3 .
Training/testing time of the 18 categorization models.

TABLE 4 .
HR aerial photo categorization accuracy variation.