High-Resolution Aerial Photo Categorization Model by Cross-Resolution Perceptual Experiences Transfer

There are thousands of observation satellites orbiting the earth, each of which captures massive-scale photographs covering millions of square kilometers everyday. In practice, these aerial photos are with high-resolution and usually contain tens to hundreds of ground objects (e.g., vehicles and rooftops). Understanding the categories of a rich variety of high-resolution aerial photos is an indispensable technique for many applications, such as intelligent transportation, natural disaster prediction, and smart agriculture. In this work, we propose a cross-resolution perceptual experiences transfer framework for categorizing high-resolution aerial photos, focusing on leveraging the perceptual features from low-resolution aerial photos to enhance the feature selection of high-resolution ones. More specifically, we first construct gaze shifting path to mimic human visual perception to both low-resolution and high-resolution aerial photos, wherein the corresponding deep gaze shifting path features are engineered. Afterward, a kernel-induced feature selection algorithm is formulated to obtain a succinct set of deep gaze shifting path features discriminative across low- and high-resolution aerial photos. Based on the selected features, low- and high-resolution aerial photos’ labels are collaboratively utilized to train a linear classifier for categorizing high-resolution ones. Extensive comparative studies have validated the superiority of our method.


I. INTRODUCTION
Due to the development of delivering plenty of satellites during a single rocket, there are many earth observation satellites launched since 1980.As we know, high-resolution aerial photos (typical resolutions over 5K × 5K) containing ground objects with sophisticated spatial interactions are well captured by these satellites.Semantically understanding these ground objects as well as the inherent spatial topologies is an important technology in lots of state-of-the-art AI systems.
As an example, we can spatially parse the distribution of different animals and forests.Then we can intelligent understand the trends of wildlife.Such application is informative for keeping habitats in the sanctuaries, especially for the endangered animals.
The associate editor coordinating the review of this manuscript and approving it for publication was Andrea F. Abate .
In geoscience and remote sensing, searchers have designed many visual annotation or classification models to characterize aerial images with normal resolutions (typically 800 × 800 ∼ 2K × 2PK).Plenty of experiments and modern AI systems have demonstrated their superior performance and convenience.Nevertheless, in practice, the previous models cannot effectively encode high-resolution aerial photos because of the following reasons: 1) Typically, there exists a rich set of multi-scale foreground objects inside an high-resolution aerial photo, as shown in Fig. 1.To calculate the semantics of an high-resolution aerial photo, we expect a bionic model that simulates the process of human perceiving the foreground salient regions.Actually, building a deep model that can simultaneously extract the visually/semantically salient regions and engineer the deep features for these extracted regions is non-trivial.2) Toward an efficient and interpretable image model for semantic understanding, we want high quality features shared between high-and low-resolution aerial images.However, instead of the original feature space, the shared discriminative features may be distributed in the high-order feature space, which may be unexpectedly high-dimensional.This makes the conventional feature selection toward the high-order feature space computationally intractable.We design a new cross-resolution perceptual experiences transfer framework that adopts the deeply-learned perceptual experiences of low-resolution aerial images to facilitate categorizing high-resolution one.An overview of our lowresolution aerial photo categorization is presented in Fig. 2. By utilizing a considerable quantity of high-resolution and low-resolution aerial photos.A machine learning algorithm is used to detect those salient regions, based on which the gaze shifting paths are generated and the deep features are calculated.Aiming at a concise set of discriminative features shared between high-and low-resolution aerial images, we explicitly map the deep gaze shifting path features onto a high-order and kernel-induced feature space.To inherit the perceptual knowledge of low-resolution aerial photos, a feature selection algorithm is developed to jointly 1) minimize the marginal/conditional distribution discrepancy between high-resolution and low-resolution aerial photos, and 2) maximize the linear classification accuracy.Based on the selected features, both labeled high-resolution and low-resolution aerial photos are employed to train the classifier.This can mitigate the sample insufficiency problem, which may cause the classifier overfitting during high-resolution aerial photo categorization.Comparative study with 17 image recognition models have demonstrated the advantage of our method.

II. RELATED WORK
Dozens of image recognition models were developed to analyze aerial photos.For image-level modeling, Chalavadi et al. [34] constructed a novel topological feature to model the inter-region connection inside each aerial photo.And a kernel-induced vector is calculated as the image representation for categorization.The authors [35] presented a weak model that semantically labels high-resolution aerial photos at image-level.The authors [36] proposed to combine the socalled random forest and semantics-aware feature extractor to classify each aerial photo into multiple categories.Akar et al. [37] developed a hierarchical CNN architecture for annotating the multiple labels of high-resolution aerial photos describing many downtown areas.Cai and Wei [5] proposed a cross-attention mechanism to learn the weights of aerial image features both horizontally and vertically.Costea et al. [39] formulated a vision transformer for aerial image classification, wherein the long-term contextual dependencies among regions can be intrinsically encoded.
For region-level modeling, Pan et al. [4] formulated a novel deep neural network for discovering multi-scale salient objects within each aerial photo.In [1], a focal loss deep architecture is proposed that optimally discovers vehicles from aerial images.Sameen et al. [38] developed a geolocalization model toward aerial photos by intelligently extracting intersections and streets.Wang et al. [8] integrated feature enhancement and soft label assignment into an anchor-independent object detector toward aerial images.Yu et al. [9] proposed a deep rotation-invariant detector that effectively estimates the angles of multi-scale objects inside aerial images.The authors [31] proposed a parallel deep model called mSODANet that hierarchically learns contextual features from multi-scale and multi-FoV (fieldof-views) ground objects.Notably, different from the above methods, our approach is bionic-inspired and accurately mimics human gaze behavior.

III. OUR PROPOSED METHOD A. DEEP GAZE SHIFTING PATH LEARNING
There are hundreds of objects and their parts in each highresolution aerial photo.According to the recent biological and psychological studies [2], humans typically attend a succinct set of visually prominent objects in their visual perception process.When human perceiving a high-resolution image, human vision system will perceive the foreground salient objects beforehand, such as an aircraft and its components.Meanwhile, the remaining backgrounds are typically kept unhandled in practice.We have to incorporate such human visual perceptual experience in a high-resolution aerial photo VOLUME 11, 2023 123237 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.categorization task.Herein, a rapid object parts extraction coupled with a novel active learning paradigm is deployed to detect the foreground salient objects.
The well-known BING [7] operator is leveraged as the object descriptor.By applying the BING operator, we receive a rich set of object patches inside a high-resolution aerial photo.Actually, humans usually attend to very few objects within each scenery.To mimic this, we use an effective active learning [6] to sequentially find K representative object patches from each high-resolution aerial photo.It encodes the following attributes: 1) high-resolution aerial photo's spatial features and 2) object patches' semantic labels.
Based on the sequentially selected K object patches, each path is constructed by connecting the K object patches (as the path 2).The constituent object patches and their spatial interactions simultaneously contribute to the gaze shifting path's appearance.Herein, given a K-sized gaze shifting path, we represent it by matrix where G 1 is a K × T -sized matrix.The T dimensions describe the CNN feature from each image patch within a gaze shifting path.G 2 represents the K × K matrix indicating node linkage.Toward a simple yet effective feature, matrix G is row-wise concatenated into a long feature vector u.

B. CROSS-RESOLUTION PERCEPTUAL EXPERIENCES TRANSFER
Theoretically, the extracted deep gaze shifting path features are usually distributed in the high-dimensional high-order feature space.Comparatively, the number of labeled highresolution aerial photos is relatively small.This inevitably causes the dimensionality curse and will in turn hurt high-resolution aerial photo categorization.To handle this problem, a cross-resolution perceptual experiences transfer framework is formulated to select a succinct set of highly discriminative features shared between high-resolution and low-resolution aerial photos.Thereby, the selected features from high-resolution and low-resolution aerial photos can be collaboratively utilized to train the categorization model.In a word, cross-resolution perceptual experiences transfer can simultaneously reduce the feature dimensionality and increase the training sample number, based on which the dimensionality curse can be mitigated substantially.

1) FEATURE MAPPING BY APPROXIMATING POLYNOMIAL KERNEL
The polynomial kernel can be mathematically represented as: where Q denotes the degree.Such kernel is comprised of features whose monomial's degree is smaller than Q.This can be further represented as: where e ∈ {1, • • • , K (K + T )} q enumerates the entire selections of q-dimensional coordinates in u, and K(K + T ) is the dimensionality of deep gaze shifting path feature.By leveraging the multinomial theorem, (2) can be reorganized into: For degree Q, there are a total of S = C Q K(K+T )+Q candidate features for feature selection, where operator C j i counts the combinations of selecting j features from i features.

2) OBJECTIVE FUNCTION OF FEATURE SELECTION
By leveraging the above explicit feature map, deep gaze shifting path feature engineered from high-resolution and low-resolution aerial photos can be represented by i=1 respectively, where M H and M L count the high-resolution and low-resolution aerial photos respectively.r H and r L denote the category labels of the high-resolution and low-resolution aerial photos respectively.Herein, a novel feature selection algorithm is proposed to select features discriminative to both high-resolution and low-resolution aerial photos.
We denote the high-resolution aerial photos as , where u H i denotes the K(K + T )dimensional deep gaze shifting path feature and r H i the corresponding category label.We denote as deep gaze shifting path feature from the entire highresolution aerial photos and the labels.Let p H (U H ) and p L (U L ) be the marginal distributions of U H and U L .The objective of our feature selection is to select an optimal feature set that predicts labels r H i M H i=1 using the input high-resolution aerial photos u H i M H i=1 under assumptions p H u H ̸ = p L (u L ) and q H u H ̸ = q L (u L ).It is reasonable to assume that there exists a binary indicator s ∈ {0, 1} S , such that p ϕ u H ⊙ s ≈ p ϕ u L ⊙ s and p ϕ r H ⊙ s ≈ p ϕ r L ⊙ s , where ⊙ denotes the inner product of pairwise matrices.Our target is to learn the indicator s.Since we practically have insufficient high-resolution aerial photos, s cannot be effectively learned due to the overfitting problem.In this way, we propose to learn binary indicator s and a linear classifier H jointly, in order to satisfy the following three criteria: 1) the distance between the marginal distribution p(ϕ u H ⊙s) and p(ϕ u L ⊙s) is sufficiently small, 2) ϕ u H ⊙ s and ϕ u L ⊙ s preserve the discriminative dimensions of deep gaze shifting path features ϕ(U H ) and ϕ(U L ), based on which p r H | ϕ u H ⊙ s ≈ p( r L |ϕ(u L ⊙ s)), and 3) the learned classifier C u H D (ϕ u H ⊙ s)H can optimally categorize the training low-resolution aerial photos ϕ(u L ).These criteria can be mathematically represented as follows: 1) Marginal distribution discrepancy minimization: Given the polynomial-kernel-based feature mapping ϕ(u) induced by (3), we aim to minimize the marginal 123238 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
distribution discrepancy by feature selection.This can be formulated as: where ∥•∥ 2 F denotes the squared Frobenius norm, the binary indicator's domain is represented by S = {s|s ∈ {0, 1} S , ∥s∥ 0 ≤ A, and A is the maximum number of selected features.
Conditional distribution discrepancy minimization: Practically, the posterior probabilities q H (r H |u H ) and q L (r L |u L ) have complicated forms.Instead, we utilize the classconditional distributions q H (r H |u H = b) and q L (r L |u L = b).More specifically, we first calculate the conditional distribution distance between high-resolution and low-resolution aerial photos labeled by b ∈ {1, • • • , B. Thereafter, we attempt to minimize the conditional distribution discrepancy: where Empirical error minimization: As we mentioned, we expect that the selected features not only minimize the distribution difference, but also be succinctly discriminative for visual categorization.Toward a succinct set of discriminative features, the third criterion is to minimize the empirical error.In our implementation, One-vs-All coding of error-correcting output codes (ECOC) [4] is employed.The empirical error of the high-resolution and low-resolution aerial photos will be minimized, i.e., ) By combining the above criteria, the final objective function is given as: This objective function is NP-hard due to the combinatorial integral constraints on s.Herein, we adopt an efficient solution as detailed in the document [40].

IV. EXPERIMENTAL RESULTS AND ANALYSIS A. COMPARATIVE STUDY
In this experiment, we evaluate our high-resolution aerial photo categorization by comparing its effectiveness and efficiency with a bunch of counterparts.We first compare our method with deep architectures tailored for aerial photo categorization.Then, our method is compared with multiple state-of-the-art deep generic object/scene recognition models.The experimental data set is from [35].
In the first place, we compare our method with seven deep categorization models [14], [15], [16], [17], [18], [19], [20] that intrinsically encode some prior knowledge of different aerial photo categories.We notice that the source codes of [14], [15], [18], and [19] are publicly available.Thereby, we conduct comparative study wherein the parameter settings are set as default.For [16], [17], and [20], the source codes are unavailable to our knowledge.In this way, we re-implement them using Python by ourselves.We have tried our best to make the reimplemented models perform similarly to the results reported in their publications.Nowadays, many deep generic recognition models perform impressively on categorizing aerial photos.In this experiment, we first compare our method with ten deep generic object categorization models: the spatial pyramid pooling CNN (SPP-CNN) [33], CleanNet [11], discriminative filter bank (DFB) [8], multi-layer CNN-RNN (MLCRNN) [12], multi-label graph convolutional network (MLGCN) [29], semantic-specific graph (SSG) [30] and multilabel transformer (MLT) [31].Furthermore, since low-resolution aerial photo categorization can be deemed as a sub-topic of scenery classification, we additionally compare our method with three well-known scenery classification models , [26], [28].For these models, only the source codes of [13] are unavailable.Thus we re-implement them using C++.
For the above 18 compared object/scene categorization models, we repeatedly test each model ten times and the average accuracies are displayed in Table 1.We method performs the best as expected.To quantify the stability of these categorization models, we report their standard errors simultaneously.1) Our method outperforms the other aerial photo categorization models remarkably due to three reasons.First, to facilitate deep model training, our competitors typically resize each original aerial photo to a fixed and much smaller size (e.g.128 × 128) for the subsequent hierarchical feature engineering.This hurts the learning of an low-resolution aerial photo categorization model since many tiny but discriminative visual details will be lost.Second, expect for our method, none of the seven counterparts can select high quality features by leveraging discriminative information from high-resolution aerial photos.Third, only our method generates gaze shifting paths sequentially capturing the semantics of low-resolution aerial photos perceived by humans.They are further incorporated into a CPKP-based feature selection for calculating category labels.Comparatively, the seven counterparts only globally/locally characterize each low-resolution aerial photo, wherein the perceptual visual features are neglected.2) The seven generic object recognition algorithms perform inferiorly than ours because of three reasons.First, these generic recognition models generally handle medium-sized images typically containing tens of salient objects.They can hardly VOLUME 11, 2023 123239 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.discover the tiny but discriminative regions inside each lowresolution aerial photo.Second, our method can flexibly incorporate the prior knowledge of high-resolution aerial photos.Contrastively, the seven generic object recognition models cannot encode such information.Third, by leveraging our CPKP-based feature selection, our method can dynamically abandon those indiscriminative regions.But the seven generic object recognition models do not have this function.3) The three scene categorization models perform unsatisfactorily on low-resolution aerial photos.This is because they deeply and implicitly learn a descriptive set of scene-aware semantic categories, such as ''birds'' and ''tables'', which infrequently appear on our low-resolution aerial photo set.Moreover, the three categorization methods can successfully handle sceneries captured at horizontal view angles.But our collected low-resolution aerial photos are captured at overhead view angles.Apparently, such view angle gap will decrease the categorization accuracy.
To quantitively analyze the importance of cross-resolution perceptual experiences transfer (CPET), we set the number of low-resolution aerial photo to zero.This means that no perceptual information from low-resolution aerial photos is utilized.We notice that the average categorization accuracy is reduced by 5.443%, which clearly shows the importance of cross-resolution perceptual experiences transfer.
It is generally acknowledged that time consumption is a key criterion reflecting the performance of a categorization model.Herein, we report the training and testing time of the aforementioned 18 categorization models.As shown in  models outperform our pipeline.This is because the architectures of [29] and [33] are much simpler than ours.Simultaneously, we observe that the per-category accuracies of [29] and [33] are both about 5% lower than ours.For the testing time comparison, our method can be conducted at a much faster speed than all the baseline methods.

B. PARAMETER ANALYSIS
We evaluate high-resolution aerial photo categorization by changing the polynomial kernel degree Q and the target dimensionality V for cross-resolution perceptual experiences transfer-based feature selection.We first fix V and tune Q from one to five and report the high-resolution aerial photo categorization accuracy.We observe that the highest accuracy is achieved when Q = 2.Meanwhile, we observe that the candidate feature number increases to 321402081 when Q = 5.Based on these observations, we prone to choose a small Q in practice.Subsequently, we fix Q at Q = 2 tune V from one to 100.Noticeably, the highest categorization accuracy is achieved when V = 15.This demonstrates that a succinct set of high quality features is sufficiently descriptive for distinguishing different high-resolution aerial photo categories.

FIGURE 1 .
FIGURE 1. Pairwise high-resolution aerial photos with their gaze shifting paths.

FIGURE 2 .
FIGURE 2. Categorizing aerial photos with high-resolution by leveraging cross-resolution perceptual experiences transfer.
U H b and U L b denote the high-resolution and lowresolution aerial photos with category label b.M H b and M L b count their number respectively.

TABLE 1 .
Accuracies with standard errors of the 18 categorization models (We refeat each experiment 20 times and report the average accuracies and each bold number represents the best result).

Table 2 ,
during training, only two baseline categorization