Brain-Inspired Remote Sensing Interpretation: A Comprehensive Survey

Brain-inspired algorithms have become a new trend in next-generation artificial intelligence. Through research on brain science, the intelligence of remote sensing algorithms can be effectively improved. This article summarizes and analyzes the essential properties of brain cognize learning and the recent advance of remote sensing interpretation. First, this article introduces the structural composition and the properties of the brain. Then, five represent brain-inspired algorithms are studied, including multiscale geometry analysis, compressed sensing, attention mechanism, reinforcement learning, and transfer learning. Next, this article summarizes the data types of remote sensing, the development of typical applications of remote sensing interpretation, and the implementations of remote sensing, including datasets, software, and hardware. Finally, the top ten open problems and the future direction of brain-inspired remote sensing interpretation are discussed. This work aims to comprehensively review the brain mechanisms and the development of remote sensing and to motivate future research on brain-inspired remote sensing interpretation.


I. INTRODUCTION
R EMOTE sensing is a technology that observes and detects the objects on the Earth by the sensors equipped on aircraft or satellites [1], [2]. It is a noncontact, long-distance detection technology that began in the 1960s [3]. It uses visible light, infrared, and electromagnetic waves radiated or the reflection by the target itself to perceive and identify the target at a long distance. The remote sensing data obtained by remote sensing technology enhances the ability of human beings to research the Earth [4]. At the same time, remote sensing applications involve many fields. It is widely used in various military and civilian areas, such as satellite surveillance, land and resources survey, land use and land cover, urban dynamic change monitoring, meteorological monitoring, environmental assessment and monitoring, and disaster investigation and evaluation. This dramatically expands the critical impact of remote sensing on human production and life [5].
Nowadays, we face many challenges in remote sensing interpretation. First, due to the quickening growth of unmanned aerial vehicles (UAV) and satellite technology in recent years, the amount of data has increased dramatically [6]. The spectral, spatial, and temporal dimensionalities of the data require more computing resources [7]. In addition, large, labeled datasets in remote sensing are not easily obtained. This restricts the use of larger models to improve the accuracy of the algorithms. Last but not least, the interpretability of algorithms is necessary for remote sensing interpretation [8].
In recent years, artificial intelligence technology has improved the accuracy and efficiency of remote sensing interpretation. artificial intelligence has realized automatic feature extraction, parameter learning, and classification.
Artificial intelligence aims to study and develop computer algorithms that can handle tasks requiring human intelligence. Its development is closely related to brain science [9]. Brain science is to study the structure, function, and operation mechanism of the biological brain and further understand how the brain processes information, mines knowledge, and makes decisions. Artificial intelligence draws inspiration from brain science and designs intelligent algorithms.
In 1943, neuroscientist W.S. McCilloch and mathematician W. Pitts established the MP model, an abstract and simplified model constructed according to the structure and working principle of biological neurons. The so-called "simulated brain" was born [10]. In 1949, Hebbian learning was proposed. This algorithm is inspired by the dynamics of biological nervous systems. According to the study, a synapse between two neurons is strengthened when the neurons on either side of the synapse (input and output) have highly correlated outputs. Hebbian learning learns from this property and improves the weight between two highly correlated neurons during the learning process [11]. In 1958, perceptron was proposed to model the way information is stored and organized in the brain [12]. In 1983, physicist John Hopfield proposed a neural network for Associative Memory called the Hopfield network [13]. In 2006, Geoffrey Hinton proposed a multilayer neural network for data reduction, which opened the curtain of deep learning research [14].
The research on artificial intelligence is closely related to the brain. These algorithms, inspired by the structure and characteristics of the brain, continue to promote the development of artificial intelligence. Artificial intelligence is also constantly looking for new inspiration from biological brains.

A. Motivation
In recent years, as more and more diverse neural networks have been proposed, people have paid more attention to the design of brain-inspired algorithms, and many reviews of braininspired algorithms have been proposed. Hassabis et al. [15] analyzed the historical interaction between artificial intelligence and neuroscience fields, providing new perspectives to develop artificial intelligence. Yang et al. [16] provided a comprehensive review of the research of brain-inspired artificial intelligence and its related engineering technique. Strisciuglio and Petkov [17] focused on the relationship between research in neuroscience and advances in computer vision. Simeone et al. [18] organized a special section to introduce machine learning (ML) and signal processing algorithms for brain-inspired computing. Fan et al. [9] researched new brain imaging techniques to explore the secrets of brain science and built brain dynamic connectivity maps. Jiao et al. [19] discussed the main problems and applications of bio-inspired computation and recognition, introducing algorithm implementation, model simulation, and practical application of parameter setting. Tianyuan et al. [20] introduced the relationship between artificial intelligence and neuroscience, the research status of brain-inspired intelligence, and the profound influence of artificial intelligence in other fields.
The characteristics of the brain and brain-inspired algorithms are worth discussing. The brain-inspired algorithms are developed according to the research on the latest brain characteristics and improve performance, efficiency, and interpretability. This will provide a new perspective for remote sensing interpretation. In this review, we mainly investigate the features of the brain and introduce the related brain-inspired algorithms. In addition, the interpretation (data types and main applications of remote sensing) and implementation (public datasets, software, and hardware) are presented. We attempt to summarize the characteristics of the brain and discuss remote sensing tasks to provide readers with new perspectives on remote sensing data analysis and promote the design of brain-inspired algorithms. The main contributions of the present review can be summarized as follows.
1) We provide a comprehensive survey of brain structure and summarize the brain properties as sparsity, learning mechanism, selectivity, directionality, plasticity, and diversity. 2) This survey investigates five essential applications in remote sensing data interpretation, including object classification, target detection, change detection, video tracking, and 3-D reconstruction. These methods cover image tasks in remote sensing, as well as video and point cloud data developed in recent years. 3) The public datasets and an overview of related software and hardware are summarized. 4) Current challenges and future research directions are presented. The rest of this article is organized is as follows (as shown in Fig. 1). The basic structure and characteristics of the brain and the brain-inspired algorithms are presented in Sections II and III. In Section IV, the data types, such as optical images, radar images, airborne light detection and ranging (LiDAR), and remote sensing videos, are summarized. In addition, the latest advances in the five applications of remote sensing are presented. In Section V, the public datasets, software platforms, and hardware resources required to implement the algorithms are discussed. We discuss the future challenges and directions of combining brain mechanisms with remote sensing interpretation in Section VI. Finally, Section VII concludes this article.

A. Biological Structure of the Brain
The brain is the principal organ in the central nervous system. It is mainly composed of the cerebral cortex, cerebellum, diencephalon, and brainstem. Among them, the cerebral cortex is the most advanced part of conscious thinking and sensory processing, and it is also the main part of the brain. It has the ability to recognize, represent and learn. It contains four functional areas: temporal lobe, occipital lobe, frontal lobe, and parietal lobe [21]. The specific division is shown in Fig. 2. 1) Occipital lobe: It is the visual processing center of the brain, including low-level visuospatial processing (position, spatial frequency), color discrimination, and motion perception. 2) Temporal lobe: It is responsible for processing sensory input using visual memory, language, and emotional connections to derive higher level information. 3) Parietal lobe: It can process various sensory information, including touch, smell, taste, etc. It is also related to language and memory. 4) Frontal lobe: It is the most advanced part of brain development and has advanced cognitive functions. It is mainly responsible for the processes of movement, cognition, and thinking. It is capable of tasks, such as attention, judgment, thinking, analysis, calculation, and planning, and is related to human needs and emotions. The cerebral cortex facilitates the development and computation of neural networks. Brain perception and cognition are the biological basis, providing new ideas for the efficient and accurate realization of artificial intelligence perception and understanding. Unfortunately, these natural biological properties are not fully considered in current neural network designs. Therefore, brain-inspired modeling and algorithm research is significant and can further promote the development of a new generation of artificial intelligence.

B. Biological Properties of the Brain
The research on the biological properties of the brain has opened a new window for brain-inspired remote sensing. Analyzing the biological properties of the human learning mechanism can help us establish a variety of algorithms to simulate the brain. Understanding the brain mechanism has recently been a significant new development trend in the international academic community. For the perception and cognition of knowledge, the brain mainly has biological properties, such as sparsity, learning mechanism, selectivity, directionality, plasticity, and diversity.
1) Sparsity: The biological brain, especially the human brain, is a hierarchical, sparse, and periodic structure [22], as shown in Fig. 3. Sparsity plays an important role in biological brains. Olshausen and Field [23] presented the neuron sparse coding theory. In 2007, Huber et al. [24] and Houweling and Brecht [25] tested the hypothesis of "sparse coding" of neurons with rat experiments. The processing of scene information by the biological retina is sparse, which makes learning more efficient. In the brain's primary visual cortex (V1), researchers in computational neuroscience believe that sparse coding is the main way of image representation in the visual system. The neurons in the V1 are also sparse in the dynamic processing and computation of information. Simultaneously, the neurons in the V4 area realize the representation of visual information through sparse coding. The higher the level, the larger the receptive field, that is, the information processing is from a local to a larger area. When the level is low, the area processed by the receptive field is smaller, and the sparsity is stronger, and vice versa.
2) Learning Mechanism: The human brain is good at rapid cross-task learning and generalized cognition. In 2011, Tenenbaum et al. [26] pointed out that the brain has a strong ability for abstract representation and can learn generalized knowledge from a small amount of data. In the brain, the region responsible for cognition and learning is mainly the hippocampus. Cells in the hippocampus are interconnected into networks, each of which is defined by a more abstract grid of cells. Based on these abstract templates for expressing relationships and symbols, it is easy for the brain to directly apply the existing abstract templates and recombine them to understand new things when receiving external environmental stimuli or tasks.
The human brain stores a vast amount of knowledge about the world that underlies language, thought, and reasoning. There are two kinds of knowledge representation in the human brain, sensory and language derived. The ability to form memories is a key to learning and knowledge accumulation. In 2020, Josselyn and Tonegawa [27] explored evidence of engram cells as the basis of memory (especially in rodents), investigating how new information is integrated into existing knowledge memory.
3) Selectivity: Roelfsema [28] pointed out that the brain has the ability to pay attention to special things and autonomously control the attention area in a new environment.
Selective attention modulates neuronal activity in nearly all brain structures responsible for visual processing, including ventral pathways (from V1 through extrastriatal cortex (V2-V4) to inferotemporal cortex), dorsal pathways (from V1 to V2 to the middle and medial temporal lobes and parietal lobes responsible for motor information processing), prefrontal lobes, subcutaneous structural nuclei, such as lateral geniculate body, superior colliculus, occipital nucleus, dorsomedial thalamus, and reticular nucleus of thalamus, striatum, and substantia nigra reticularis.
At the same time, the brain receives a large amount of information. However, it cannot process all the information entering the system with the same degree of priority. Only some information can be filtered and processed through selective attention and enter consciousness. For example, the primary visual cortex can generate visual saliency maps in the very early stages of visual information processing to guide the distribution of spatial selective attention, regulate sensory input, and improve people's perception and behavior. In addition, selective attention has various regulatory effects on the neural representation of target stimuli, such as enhancing neuronal firing and firing synchronization, enhancing neuronal selectivity, enhancing neuronal signal-tonoise ratio, and moving and reducing neuronal receptive fields. Therefore, selective attention is a deeply sophisticated cognitive process that always coordinates the brain's cognitive processing. 4) Directionality: In 1971, O'Keefe discovered in the course of experiments that there are "place cells" in the hippocampus that can record location information, which can be selectively activated to give specific locations a special identity. In the mid-1980s and early 1990s, the "head orientation cells" were discovered that determine the orientation of the head, marking orientation with selective excitation. At the same time, the "grid cell" that can delineate a plane coordinate system was also discovered, which can record all the position information generated during the movement, etc. These cells cooperate with each other to create a 2-D map of the brain, the material basis for cognitive maps. In 2015, Finkelstein et al. [29] pointed out that there are azimuthal and oblique angle cells in the brain that can perceive direction and position information.

5) Plasticity
: The brain will change the internal neural mechanism due to the needs of the external environment, that is to say, the brain is constantly assimilation and accommodating, so the brain has plasticity [30]. Brain plasticity refers to the ability of the brain to be modified by environment and experience. It can be divided into structural plasticity and functional plasticity. The structural plasticity of the brain means that the connections between synapses and neurons within the brain can establish new connections due to the influence of learning and experience, thereby affecting the behavior of individuals. It includes neuronal plasticity and synaptic plasticity. Functional plasticity can be understood in that through learning and training, the function of a representative area of the brain can be replaced by adjacent brain areas, and it is also manifested in the recovery of brain function in patients with brain injury to a certain extent after learning and training. Brain plasticity is closely related to learning and memory.
6) Diversity: The diversity of neurons is the basis for the complex and delicate functions of the brain. In 2021, Berg et al. [31] used techniques, such as patch clamp, to reveal the richness of neuronal types in the cortex. In 2021, Yao et al. [32] constructed the mouse primary motor cortex, characterized more than 56 neuron types, and analyzed the developmental mechanism of the diversity of interneurons in the human brain. He also discovered the interneuron precursor cell types that exist specifically in the human brain and revealed the richness and diversity of human brain interneurons compared with other species.

III. THEORY OF THE BRAIN-INSPIRED ALGORITHMS
In this section, we discuss related brain heuristic theories from the perspective of brain properties. First, multiscale geometric analysis and compressed sensing (CS) have been extensively studied due to sparsity in the brain. The attention characteristics inspired the combination of attention mechanism and deep neural network to create SENet [33], nonlocal [34], transformer [35], and other networks. The training of artificial intelligence algorithms is enriched by reinforcement learning and transfer learning, which draw on the brain's natural learning process. This section starts from the abovementioned braininspired algorithms. It combines algorithms in remote sensing to provide readers with new ideas for combining brain-inspired algorithms and remote sensing.

A. Compressed Sensing
CS is a breakthrough theory for information acquisition. When the sampling rate is substantially lower than the "Nyquist" sampling rate, CS can still accurately reconstruct sparse signals with high probability. It gets discrete samples of the signal with random sampling and reconstructs them using a nonlinear reconstruction technique. Its core idea is mainly based on the sparse structure of the signal and the uncorrelated characteristics of the signal [36], [37], [38].
The sampling method of CS is a simple operation correlating a signal with a particular set of waveforms. These waveforms are independent in the sparse space. The CS method can directly obtain compressed samples through the time domain transformation of the signal, which reduces the redundant information in the signal sampling process. The optimization algorithm is required to recover the original signal from the compressed samples. It is an underdetermined linear inverse problem where the signal is known to be sparse. Therefore, the prerequisite for realizing CS is that the signal is sparse in the frequency domain, and a random subsampling mechanism is adopted.
CS has two important operations to satisfy the above conditions: sparse representation and compressed observation. Sparse representation is the representation of complex signals as uncorrelated sparse signals. Compressed observation is to achieve random subsampling. Finally, sparse representation, compressed observation, and signal reconstruction constitute the three parts of the CS framework. To realize the CS, the sparseness of signals is the premise. The basis of CS is the compressed observation theory. The main components of CS are the reconstruction models and techniques [19].
The sampling and reconstruction processes of CS are shown in Fig. 4. In general, a complex signal can be represented as sparse coefficients, which satisfies the prerequisite of CS. Then, the observation signal is obtained by sampling with an observation matrix. During the reconstruction process, the observation signal and sensing matrix are known. A reconstruction algorithm is adopted to reconstruct the sparse coefficients. Finally, the complex signal can be recovered from the sparse coefficients.
1) Sparse Representation: The concept of sparse representation was first proposed in 1959 by Hubel and Wiesel [39] in their study of cellular receptive fields in the visual stripe cortex of cats. The experimental results established a precedent for sparse representation by showing that the receptive fields of cells in the "primary visual cortex" may provide a sparse response to visual perception information. In 1969, a sparse representation model based on Hebbian local learning principles was proposed [40]. The construction of the associative mechanism in the network structure benefits from the sparse representation's ability to maximize memory capacity. Houweling and Brecht et al. [25] conducted biological visual neurophysiological experiments that effectively supported the hypothesis of sparse neural coding.
According to the type of sparse matrix, the sparse representation methods of signals can be divided into the following three types: orthogonal transform basis method, multiscale geometric analysis method, and overcomplete dictionary method. To cover more signal types, the concept of the dictionary is proposed. Compared with the complete dictionary, the representation of the signal under the overcomplete dictionary is more sparse. The study of dictionary learning has grown in popularity in signal processing. There are two main ways to construct overcomplete dictionaries: using predefined analysis dictionaries (Heaviside, Gabor, Dirac, Fourier, and Wavelet dictionaries) or using dictionary learning algorithms (K-means, K-SVD algorithm, maximum likelihood estimation, and shift-invariant dictionary learning) [41].
2) Compression Measurement Matrix: The research focus of compressed observation theory is using a few nonadaptive observations to obtain enough signal information for reconstruction. Commonly used Gaussian random matrices and Bernoulli matrices belong to the category of random measurement matrices. Such matrices have high reconstruction accuracy but require large storage space and time complexity. Deterministic measurement matrices not only save storage space compared with random measurement matrices, but also are relatively easy to confirm whether they meet the Restricted Isometry Property criteria [42]. In addition, some deterministic measurement matrices can be obtained by applying a special structure. Corresponding fast algorithms can be designed to enhance the effectiveness of reconstruction. Partial Fourier matrices, structured measurement matrices, and partial Hadamard matrices are commonly used as deterministic matrices.
3) Sparse Reconstruction: Sparse reconstruction is an essential part of recovering the signal in CS. It needs to obtain the original signal through the compressed observation of the signal. Greedy, relaxation, and natural calculation methods are commonly used to solve the reconstruction problem.
The greedy method, also known as an iterative method, is an essential algorithm in solving sparse signal reconstruction problems. It uses an iterative method to approach the final solution gradually.
The convex relaxation reconstruction method is a kind of reconstruction method that has been widely studied and applied. It uses the l 1 norm to approximate the l 0 norm and simplifies the nonconvex optimization problem to the convex optimization problem. The convex optimization problem is easy to solve the reconstruction models.
Evolutionary algorithms have self-organization, selfadaptation, and self-learning capabilities. It can solve various complex problems that are difficult to solve in traditional computing methods without requiring complex reasoning calculations.
In the CS theory, signal sampling and compression can be performed simultaneously, discarding many redundant data during high-speed sampling. It dramatically reduces the sampling rate and computational cost of the sensor. As the key to CS theory, signal reconstruction is essential to solving NP-hard problems. The evolutionary algorithm can be used to learn the optimal atomic combination in the dictionary direction, and the optimal atomic combination can be used to reconstruct the image. Meanwhile, the original optimization problem of CS is nonconvex and a combinatorial optimization problem. This solves the problem with the advantages of evolutionary algorithms and increases the flexibility and adaptability of the compressive sensing reconstruction algorithm.
CS is adopted to compress data usually in remote sensing. For example, hyperspectral images (HSIs) have high spectral resolution bringing a great challenge to the data storage and transmission [43]. Wang et al. [43] proposed a CS algorithm based on spectral unmixing. It samples the HSIs both spatially and spectrally and jointly optimizes the endmember extraction and abundance estimation. Xue et al. [44] designed a nonlocal tensor sparse and low-rank regularization approach for HSIs compressive sensing reconstruction. A subspace-based nonlocal tensor ring decomposition method is proposed for HSIs compressive sensing reconstruction [45]. Furthermore, Ghahremani et al. [46] leveraged the compressive sensing to pan-sharpen the low-resolution multispectral data with high-resolution panchromatic data.

B. Multiscale Geometry Analysis
Neuroscientists have shown that the receptive field of the mammalian visual cortex has local, directional, and band-pass characteristics [47]. The critical details in natural situations are only partially captured by neurons. Multiscale geometry uses the base functions to capture the partial detail of the signal. The base functions are rectangles, which can approximate the singular curve with the fewest coefficients and fully exploit the original function's geometric regularity. At the same time, the support interval's direction of base functions manifests the directionality of multiscale geometric analysis.
Multiscale geometry originated from wavelet analysis, beyond the wavelet analysis [48]. Wavelet analysis has achieved great success in various applications. The wavelet analysis can represent 1-D signals more sparsely than the Fourier analysis. However, in the case of 2-D or high-dimensional, wavelet analysis can only be formed into separable wavelets with limited directions, so it cannot achieve the optimal representation of high-dimensional signals. Multiscale geometric analysis is designed to solve this problem [49]. As shown in Fig. 5, a comparison of the contour representation with the wavelet analysis and multiscale geometric analysis is presented. The multiscale geometry analysis uses a more sparse representation to capture the 2-D contour.
Adaptive and nonadaptive are the categories under which the multiscale geometric analysis of pictures falls. The adaptive approach often starts with edge detection and uses the edge information to approximate the original function accurately. In fact, it is a combination of edge detection and image representation, such as Bandelet [51] and Wedgelet [52]. Nonadaptive methods do not use the geometric features of the image as a priori but directly decompose the image on a set of fixed base functions, eliminating the need for dependence on the image's structure. The represent algorithms are Ridgelet [53], Curvelet [54], and Contourlet [55].
The effort of fusing multiscale geometric analysis with neural networks is also growing with the emergence of deep learning. Contourlet CNN [56] is proposed to extract sparse and efficient representations of images. The contourlet transform (CT) is first used to extract the spectral features of the image and then fused with the spatial features extracted by the CNN network. Chen et al. [57] proposed ContourletNet to implement rain removal. It utilizes the multiscale, multidirectional, and hierarchical characteristics of CT to design a hierarchical multidirectional network, extracting multiple directional subbands and semantic subbands of different scales. The neural contourlet network [58] utilizes the CT to capture the geometric information of the spatial domain in the scene for depth estimation.
In remote sensing data analysis, the multiscale geometric analysis also plays an important role. For unsupervised change detection in SAR images, Zhang et al. [59] proposed adaptive contourlet fusion clustering. Aiming at the characteristics of polarimetric SAR, Li et al. [60] proposed a complex contourlet-CNN for PolSAR image classification. The method uses CT to help complex CNN capture abstract features of specific directions and frequency bands and can retrieve the region and direction information corresponding to the extracted features. Gao et al. [61] proposed a multiscale curvelet scattering network to improve the multiscale directional information of the scattering process.

C. Attention Mechanism
Selectivity in the brain is the core mechanism. Humans can quickly eliminate distractions and capture important information. Drawing on this mechanism, attention has become a significant component of neural network architecture. It has several uses in computer vision, statistical learning, speech recognition, and natural language processing.
The reason why the attention mechanism has received widespread attention is that, on the one hand, it stimulates the mechanism of the human brain. On the other hand, we can partially explain the neural network's performance and enhance the model's efficacy by visualizing the attention maps.
The recurrent neural network (RNN) structure was the first neural network to employ the attention mechanism as part of the encoder-decoder framework of RNN to encode long input sentences [62]. It has steadily gotten into the field of computer vision in recent years with an increase in attention mechanism variations. Deep learning and visual attention techniques have been successfully combined in several studies. The main goal of the attention mechanism in computer vision is to train the model to concentrate on significant details while dismissing irrelevant ones. Current attention methods can be divided into spatial attention and channel attention (as shown in Fig. 6).
The fundamental idea behind the attention mechanism in the spatial domain is to apply the appropriate spatial transformation to the spatial domain information. It helps the neural network extract important information from the images. Each layer of a convolutional neural network will output a feature map. For convolutional neural networks to perform spatial attention, a weight matrix must be learned for each pixel in the feature map [63]. The weight matrix will be multiplied by the feature map to balance the influence of each pixel.
The fundamental concept of channel-based attention is to suppress the invalid or small effect features and highlight the effective features to improve performance. This is done by learning the feature weights on the channel domain through the network [33]. In particular, it automatically determines the relevance of each feature channel through learning and then increases beneficial features and suppresses features that are not useful for the present job. Usually, pure channel-based attention has the same weight in the spatial dimension. That is, the information in each channel is directly global average pooled, and the local information in the channel is ignored.
The attention mechanism can efficiently improve the target features in various remote sensing applications while simultaneously resolving the issue of redundant features in remote sensing data. In HSI classification, it is difficult for traditional convolutional neural networks to extract local features of HSIs. In order to strengthen the learning of local key features in the spatial domain and spectral domain of HSIs, the Resnet [64] introduces a HSI feature extraction method based on spatialspectral attention on the basis of a convolutional network and uses a calculation to obtain the mask and identifies the features required for classification and improves the representation ability of hyperspectral. In remote sensing image instance segmentation, Zhang et al. [65] proposed a semantic attention module; using additional segmentation supervision for attention, the activation values of instances under complex remote sensing noise background are significantly improved.

D. Reinforcement Learning
The process of human learning knowledge is affected by the environment and historical experience. This learning process is the plasticity of the brain. In order to simulate this property, the learning process of reinforcement learning is designed as an interaction between the agent and the environment. The agent can learn by performing different actions and obtaining different rewards in the simulated environment [67]. Deep reinforcement learning integrates the powerful understanding ability of deep learning in perception problems, such as vision and the decisionmaking ability of reinforcement learning, and realizes end-toend learning. The emergence of deep reinforcement learning has made reinforcement learning technology truly practical and can solve complex problems in real-world scenarios [68], [69].
Different from the goals of supervised and unsupervised learning, the problem to be solved by the algorithm is how the agent performs actions in the environment to obtain the maximum cumulative reward. <A, S, R, P > is the classic quadruple in reinforcement learning. A represents all the agent's actions. S is the state of the world that the agent can perceive. R is a real value representing reward or punishment. P is the world the agent interacts with, known as the model. Specifically, the strategy refers to the choice of actions the agent will make when it is in state S. The reward signal defines the goal of the agent's learning. The value function is defined to judge whether the reward in interaction is good or bad. The model is a simulation of the natural world, and it models the environment's reaction after the agent samples it. In reinforcement learning, an agent observes where actions and rewards interact with the environment to complete a task.
In remote sensing, reinforcement learning determines sequential actions by maximizing cumulative feature rewards through interaction with the environment. Especially, when only a few labeled pixels are available, reinforcement learning can achieve relatively high accuracy without using any labeled training dataset. This is well suited for remote sensing tasks with fewer data, such as in SPRL [66]. As shown in Fig. 7, SPRL adopts reinforcement learning-based methods for polarimetric synthetic aperture radar (PolSAR) data classification. The pixels are set to "state" and "work" according to reinforcement learning, and their "action" is modified by interacting with the "environment." Design a spatially polarized "reward" function from the local neighborhood to explore spatial and polarized information for more accurate classification. This results in a self-evolving and model-free classifier with a simple principle robust to speckle noise in the data. By interacting with the environment, SPRL networks can achieve high classification accuracy when only a few labeled pixels are available.
Similarly, for few-shot remote sensing data, an enhanced deep Q-network technique for classifying PolSAR images was put forth. It can provide valuable data by interacting with agents in a greedy manner [70]. Multilayer feature images and classification actions are correspondingly referred to in the network as environment states and agent actions. Certain conditions reward model predictions. Give the agent feedback by using an annotated sample set of data.
To detect the dense ships from the complex background, Fu et al. [71] proposed a ship rotation detection model based on feature fusion pyramid network-based deep reinforcement learning (FFPN-RL), which applies deep reinforcement learning to the tilted ship detection task. Angle prediction is made through three actions of the action set. Using different rotation angles in the action set makes it possible to achieve higher prediction accuracy and reduce the number of decision-making actions. The reward function encourages or penalizes angle-predicting agents with selected actions. The agent accumulates experience with the abovementioned rewards, learns from them, and ultimately chooses the appropriate action in each decision. As a result, the detecting network can produce inclined rectangular boxes for ships more efficiently.

E. Transfer Learning
As an essential ML method, transfer learning has been widely studied. It can simulate the human's learning ability of "inferring others" and transfer the knowledge learned in the past to new tasks, and speed up the cost of learning new tasks [72]. On the other hand, transfer learning can train ML methods of supervised learning using part of the labeled data, reducing dependence on a large amount of labeled data [73]. The primary trend in current transfer learning development is to use a large amount of labelled classification data to pretrain a benchmark network and then use a small amount of labeled data to fine-tune the network for different tasks.
As shown in Fig. 8, the core idea of transfer learning is applying knowledge gained from one problem to another, a different but related problem. When performing transfer learning, the constraints of the pretrained model and setting an appropriate learning rate are important. Using pretrained networks may limit the architectures used with new datasets.
A lower learning rate is usually used for the weights of the convolutional network being fine-tuned compared with the randomly initialized one. It is possible to train a good classifier using the source domain data. However, the source domain model cannot classify the target domain data well due to subtle differences between the source and target domain data. A commonly used method is to align the feature distributions of the target domain and the source domain data. The target domain data can be classified using the model trained with the source domain data.
Domain adaptation [74] is a unique type of transfer learning that occurs when the data distributions in the source and target domains vary, but the two objectives are the same. Domain adaptation is currently a significant research hotspot in transfer learning. Its task is to learn a mapping that can simultaneously map the source and target domains to a common feature space so that the composite mapping can be simulated. Combine mappings learned only in the source domain and very close to mappings learned only in the target domain.
At present, there are many related studies combining transfer learning with remote sensing data. Xie et al. [75] proposed utilizing a transfer learning strategy to leverage nighttime light intensity to train a fully convolutional CNN model to forecast evening lights in daytime photos. The features learned are helpful for poverty prediction. Chen et al. [76] used a single deep convolutional neural network and limited training samples to perform transfer learning and improve the detection accuracy of aircraft in remote sensing data. A change detection-driven transfer learning method is proposed to leverage the time series images updating the land cover maps [77]. The method aims to leverage the existing knowledge of the source domain to define a reliable training set for the target domain. This is achieved by applying an unsupervised change detection method to the target and source domains and initializing the target domain training set by migrating the detected class labels of unchanged training samples from the source domain to the target domain.

A. Data Types of Remote Sensing
As artificial intelligence has advanced, it has increasingly become used in more and more applications with impressive results [78]. The field of remote sensing is no exception [79]. Intelligent interpretation of remote sensing is crucial to study in many areas, including environmental monitoring, land resources [80], crop monitoring [81] and yield estimation, forest carbon sink estimation [82], and national defense security [83]. Intelligent remote sensing interpretation is also an important requirement for national strategic development [8].
Remote sensing image refers to films or photos that record the size of electromagnetic waves of various ground objects, mainly divided into aerial photos [84] and satellite photos [85]. Remote sensing imaging methods mainly include aerial photography, aerial scanning, and microwave radar. Remote sensing images can be broadly separated into active and passive remote sensing based on various detecting techniques [86]. According to the capture spectral range of the sensor, it is divided into ultraviolet remote sensing, visible light remote sensing, infrared remote sensing, microwave remote sensing, and multiband remote sensing [87]. This section mainly summarizes the widely studied optical remote sensing images and radar images in the existing remote sensing data (as shown in Fig. 9), including optical remote sensing images [88], radar images [89], LiDAR point cloud data [90], and remote sensing videos [91].
1) Optical Images: Optical images are a kind of remote sensing data that obtains target information on different spectra by dividing the radiation of objects into several narrower spectral bands. The same objects have similar spectral characteristics [92]. The radiation energy of different objects in bands is different.
According to the number of captured spectral bands and the narrowness of the spectral bands, optical images can be roughly classified into three types: panchromatic, multispectral, and hyperspectral [93]. Generally, most satellites can take panchromatic and multispectral images.
Panchromatic images: Panchromatic images have only one grayscale image band, i.e., the brightness of a particular pixel is proportional to the pixel value. The pixel value is related to the intensity of solar radiation reflected by the target. Panchromatic images generally have a high spatial resolution, but their images have little spectral information [94].
Multispectral images: Multispectral imagery usually refers to three to ten spectral bands expressed in pixels. Each band can be acquired using a remote sensing radiometer [95]. An image with both the high GSD and abundant spectral information can be generated by properly fusing the panchromatic image with the multispectral image.
HSIs: While hyperspectral data contain very narrow bands (10-20 nm) [96], HSIs may have thousands of bands. For each band of hyperspectral data, imaging spectrometers are often required to acquire them. Compared with high-resolution, multispectral images, HSIs have high spectral resolution and abundant bands. It contains rich radiation, spatial and spectral information [97], and is a comprehensive carrier of various details. The areas of feature mapping and resource exploration have made extensive use of HSIs [98]. Unlike standard RGB images, HSIs are often multichannel images. Hyperspectral rich band information often contains richer features. We can select the band by the sensitivity of different ground objects to different bands to highlight certain objects [99].
2) Radar Images: Radar is an active microwave remote sensor that emits microwave radiation and receives electromagnetic waves reflected from a target [100]. The radar imaging system mainly includes five parts: a pulse generator, transmitter, radar antenna, receiver, and recorder. The pulse generator generates a high-power FM signal and repeatedly emits microwave pulses of a specific wavelength at a particular time interval through the transmitter. Commonly used radar images can be divided into synthetic aperture radar (SAR) and PolSAR.
SAR: SAR [101] is an active microwave imaging device. Its imaging principle forms the virtual antenna of the radar through the movement of the flight carrier, thereby obtaining high-azimuth resolution radar images. SAR can be divided into airborne and spaceborne according to aircraft type. Both have their advantages and uses. Airborne SAR has higher resolution, whereas spaceborne SAR can observe a wider area for a long time, has a global macroscopic effect, and is periodic. The cost is also lower than the airborne, so spaceborne SAR has been widely used. According to whether synthetic aperture processing is performed, imaging radar can be divided into real aperture radar (RAR) and SAR [102], [103] [as shown in Fig. 10(a)].
Real aperture imaging radar transmits a pulsed radio beam with a very narrow width to the side of the radar antenna (called the range direction) to the traveling direction of the aircraft (called the azimuth direction). The beam irradiates a long narrow ground strip perpendicular to the flight direction. Then, the radar antenna is converted into the receiving working state and receives the backscattered wave reflected from the target [104], [105]. As the vehicle travels, the emitted beam scans the surface in this continuous strip along the direction of flight. The radar image is created line by line [106].
The resolution of radar images includes distance resolution and azimuth resolution. Distance resolution refers to the resolution in the vertical flight direction. The azimuth resolution refers to the resolution along the flight direction [107]. The distance resolution is mainly related to the pulse signal emitted by the radar system. The shorter the pulse duration, the higher the distance resolution. However, the transmission power will decrease if the pulse width is too small. In addition, the signalto-noise ratio of the reflected pulse will also decrease, which is contradictory [108].
The basic principle of SAR is to use a small antenna as a single radiating unit to make it move continuously along a straight line. The reflected pulse of the same target at different positions performs related processing, which can obtain higher image resolution [108]. SAR is the same as RAR in the distance direction, using pulse compression to improve the resolution. In the azimuth direction, the resolution is improved by the principle of synthetic aperture [109]. While the position of the radiating element is constantly changing, the received signals can be recorded and processed to obtain the same effect as the observation with a longer virtual antenna length (synthetic aperture length) of the actual antenna.
By transmitting electromagnetic pulses and receiving target echoes for coherent imaging, SAR can shoot multipolarization, multiband, high-resolution images all day, all weather. It obtains backscattering information of ground objects to realize the task of Earth observation. Compared with optical and infrared remote sensing technologies, SAR belongs to microwave remote sensing [110]. It can not only obtain the Earth's surface information, such as topography and landforms, but also penetrate the surface to obtain underground, concealed, and high-resolution ground data in harsh environments.
PolSAR: PolSAR system [111] is developed based on the single-channel SAR system, which can provide multidimensional remote sensing information of targets. Compared with traditional single-channel SAR, polarimetric SAR not only utilizes the amplitude, phase, and frequency characteristics of target scattered echoes but also utilizes its polarization characteristics [112]. For example, the L-band with a longer wavelength can penetrate forests and surface vegetation coverage. It can be used in the military to discover hidden targets in jungles or shallowly buried surfaces [113].
By sending and receiving electromagnetic waves with various polarizations, PolSAR measures the polarization scattering properties of ground objects and builds up the polarization scattering matrix. The polarization of electromagnetic waves is sensitive to the target's physical characteristics, such as surface roughness, dielectric constant, geometry, and orientation. Thus, the polarization scattering matrix includes abundant target information.
PolSAR obtains polarization scattering matrixes by measuring the scattered echoes in each resolution unit on the ground [114]. The amplitude and phase properties of the target scattered echoes can be completely described using these polarization scattering matrices.
When the electric field of the electromagnetic wave is parallel to the scattering surface, the electromagnetic wave is called a horizontal (H) polarized wave. Similarly, the perpendicular one is called vertical (V) polarized waves. Therefore, PolSAR can be divided into four polarization modes based on the transmitting and receiving antenna's direction.
As shown in Fig. 10(b), there are four polarization combinations: VV, HH, VH, and HV. For example, VV polarization, namely vertical transmission/vertical reception, indicates that the polarized SAR transmitting antenna transmits vertical electromagnetic waves, and the receiving antenna also accepts vertical electromagnetic waves. By obtaining four basic polarization combinations (HH, HV, VH, and VV polarizations) [115], the received power value of the antenna in all possible polarization states can be accurately calculated.
In recent decades, PolSAR technology has developed rapidly, and its wide application has also received increasing attention [116]. At the same time, people's demands for SAR are growing, and they want to obtain images of the same target in several frequency bands, polarizations, and viewpoints. In addition, SAR miniaturization is also significant due to the need for military unmanned reconnaissance aircraft. Nowadays, PolSAR is one of the most sophisticated sensors used in remote sensing. It has many practical applications and importance in civil and military fields.
LiDAR determines the relative distance between the scanner and the object by measuring the signal travel time [119]. Compared with the data obtained by traditional photogrammetry, point cloud data can reflect terrain information more accurately. The data collected by airborne LiDAR are a series of discrete 3-D points with irregular spatial distribution, called "point cloud." As shown in Fig. 11, airborne LiDAR systems mainly include laser scanners, inertial navigation systems (INS) [120], and dynamic differential GPS receivers. The laser scanner measures the distance from the launch point of the laser to the ground target. The inertial navigation system uses the inertial measurement unit (IMU) [121] to measure the attitude parameters of the aircraft's central optical axis scanning device. The dynamic differential GPS receiver is used to determine the spatial location of the launch point of the LiDAR.
After the airborne LiDAR system completes the laser scanning, the data obtained include the position, orientation, and laser scanning distance [122]. Among them, the position and orientation include differential GPS and IMU information. These data record the information of each laser pulse, including position, azimuth/angle, distance, time, intensity, echo, and other data obtained by the system during flight. The X, Y, and Z coordinates of the laser point in the WGS84 coordinate system can be calculated. These discrete points with precise 3-D coordinates are called the LiDAR point cloud [123].
The 3-D LiDAR point cloud data include information, such as the spatial 3-D coordinates of the point, echo intensity, echo times, and scanning angle [124]. In practical applications, the information frequently employed is the point cloud geometry, laser intensity, and laser echo data returned by emitted laser pulses. The laser echo signal is produced when a laser pulse is fired from a laser scanner and is then reflected or scattered by a ground point. The airborne LiDAR system may offer not only the 3-D coordinates of the target point but also intensity information of the laser echo signal [125]. Due to the different reflection characteristics of each material to the laser signal, the point cloud data can easily distinguish the boundaries of different objects for object classification.
4) Remote Sensing Videos: Remote sensing video [126] is usually divided into satellite video and UAV video according to the platform that carries the sensor. Satellite video is a kind of onboard video. It generally refers to the video obtained by satellites in the fields related to research and exploration of space. UAV video is the videos captured by UAV. The illustration of remote sensing videos is shown in Fig. 12.
Satellite Videos: Satellite imagery refers to a satellite platform that carries an image payload and can obtain images of ground target areas.
Satellite videos [127] can continuously image the target area for a long time, providing dynamic information and realizing long-term dynamic real-time monitoring. The camera is mounted on a microsatellite platform and consists of a telescopic objective lens, an area array focal plane detector, and an electronic processing circuit [128]. The telescopic objective lens images the ground scene within the 2-D field of view on the image plane, and after photoelectric conversion and electronic circuit processing of the area array detector located at the image plane, the remote sensing image of the ground scene is obtained. When the shutter that controls the exposure is opened, the light emitted by the ground scene is transmitted through the atmosphere and reaches the camera's entrance pupil. The telescopic objective focuses on the area array focal plane detector to obtain a frame of video of the target. As the satellite platform flies in orbit, there is relative movement between the camera and the ground scene. When the shutter is opened again, another frame of the target is obtained. This cycle continues to form a frame push process. In the process of frame push imaging, the exposure time is often more significant than the integration time corresponding to a single pixel. The captured image is prone to displacement in the direction along the track, that is, image movement and the image easily becomes blurred. The image movement compensation device, such as a reaction wheel or gyroscope, can be used to adjust the camera attitude to eliminate or reduce the impact of image movement. After multiframe image compression, frame alignment algorithm, and other software processing, a continuous dynamic video is finally formed.
As a new method of acquiring image data for Earth observation, satellite remote sensing video can be applied to largescale dynamic target change monitoring and its instantaneous characteristic analysis [129]. It reduces the time interval between adjacent image frames by adopting the "image recording" method for a specific area, which not only achieves large-scale coverage but also makes up for the limitation of the reentry period of traditional satellites. Compared with conventional remote sensing satellites, the target observation area of satellite remote sensing video is small, but the timeliness is good [130]. It can realize fixed-point and fixed-range remote sensing monitoring in small areas, which makes it have unique application advantages in some major engineering fields. For example, it can keep abreast of the progress and construction of major projects and provide real-time video information support for the impact on the surrounding ecological environment.
Compared with traditional video surveillance image data, satellite remote sensing image data have the following challenges [131].
1) In the process of satellite remote sensing image imaging, the slow movement of the sensor causes the displacement of buildings, trees, and other targets to change, resulting in many false moving targets, making the background more complicated. 2) Due to the limitation of the spatial resolution of satellite remote sensing imaging, the target is only a few to a dozen pixels in size in the image, and the contrast with the background is low, so it is impossible to obtain more detailed information of the target. 3) In the satellite videos, factors, such as illumination change, shadow movement, and others, lead to the dynamic changes in the background. Due to the low resolution, these dynamic changes are more likely regarded as the moving target causing false alarms. We directly apply traditional moving target detection methods in satellite videos resulting in false detection. UAV Videos: UAV [132] is a kind of unmanned aerial vehicle. With the improvement of hardware performance and the development of image processing algorithms, the research on UAV vision has become a hotspot. Due to geographical restrictions, the advantages of large-scale, multiangle, high-resolution data can be obtained. It plays an increasingly important role in target tracking, image stitching, power line inspection, island monitoring, coastline inspection, postdisaster monitoring, and river flood season monitoring [132].
In addition to takeoff and landing, the flight state of the UAV can be roughly divided into the hovering state and the cruising state, and the videos obtained in these two states have different characteristics. The drone can shoot stable video in the hovering state. Still, the rotation of the wing and the influence of the external wind will cause the picture to shake, resulting in irregular motion of the video background. The UAV cruising state refers to the translational flight state of the UAV in forward and backward flight. In the video shot, the image has a large offset in a short period. In addition to the moving target, the background also has much movement.
Compared with satellite videos, UAV-borne image data has the following advantages.
1) Make up for the lack of timeliness of satellite remote sensing and ordinary aerial remote sensing, lack of maneuverability, and the lack of regional information due to limitations, such as weather conditions and time [133].
2) The drone images have high resolution and can obtain high-resolution panoramic images of the flight area. However, due to the long distance of satellite shooting, the resolution and accuracy of the image cannot be satisfied.
3) The UAV system has a low cost of use and simple maintenance and operation [134]. 4) The UAV system can quickly acquire visible light and infrared imaging at medium and low altitudes, conduct fast and real-time ground inspection and monitoring, and record the current image status objectively and directly [135]. Compared with other relatively stable camera equipment, such as surveillance cameras on roads and shopping malls, the high mobility of drones can make data collection not limited by geographical areas. It has unique advantages in resource and environmental monitoring, forest fire monitoring, and rescue command in areas where vehicles and people cannot reach and has become more flexible. The image data obtained by aerial cameras, satellites, etc., carried by airships at high altitudes, using UAVs for moving target detection are more challenging. Table I lists the characteristics of UAV videos compared with satellite videos.
In general, video data contain richer information than individual images in terms of content or time [136]. In particular, satellites gradually begin to develop video functions, significantly expanding the source of video data.

B. Applications of Remote Sensing
Brain-inspired remote sensing interpretation is applied to all aspects of remote sensing data processing, effectively processing the replicated and diverse data of remote sensing. In this section, we summarize the development during recent years of five applications, including land-cover classification, change detection, target detection, object tracking, and 3-D reconstruction.
1) Land-Cover Classification: Land-cover classification, which is also known as semantic segmentation in nature image processing, is one of the most basic image analysis tasks in remote sensing. It classifies each pixel in the image and assigns a category to each pixel, achieving an understanding of the image content.
In 2015, Long et al. [137] first proposed fully convolutional networks (FCN) for semantic segmentation tasks. The FCN network replaces all the fully connected layers in the neural network with convolutional layers, realizing a network composed of all convolutional layers. Since the FCN network fails to make good use of multiscale features, in 2015, Ronneberger et al. [138] proposed the U-Net network. The U-Net network utilizes the skip connection operation to make full use of the multiscale features generated during the downsampling process and then obtains excellent segmentation results. Moreover, in 2017, Badrinarayanan et al. [139] proposed the SegNet network based on the U-Net network. The network performs nonlinear upsampling in the decoder using the pooling indices computed in the max-pooling step of the corresponding encoder. In the same year, Gao Huang et al. proposed DenseNet [140]. The convolutional layer of the DenseNet network connects each layer with each layer in a feedforward manner so that the layers close to the input and the output contain shorter connections to recover information lost during convolution, since both UNet and SegNet fail to fully utilize the local neighborhood information around pixels. Also, Chen et al. [141] proposed the DeepLabV3+ network. The DeepLab network utilizes atrous spatial pyramid pooling (SPP). Multiscale local receptive fields of pixels are fused while reducing resolution.
Compared with natural images, remote sensing images have the following characteristics.
1) The size of the same class objects varies widely, and the problem of size change needs to be solved. 2) Due to the fact that satellites shoot the ground at high altitudes, the obtained images are very wide in scope. The object occupies very few pixels, which generates the problem of sample imbalance. 3) When shooting in a large area, the same class of objects show a variety of different appearance because of weather, light, and other natural conditions. 4) Large-scale shooting is usually accompanied by low resolution, which makes each semantic region lacks morphological contour information. These characteristics of remote sensing images impose higher requirements for land-cover classification.
Land-cover classification can be roughly divided into objectbased and pixel-based methods. The object-based method divides the image into regions and classifies the regions according to the feature of the whole region. While the pixel-based method does not need region division and directly uses the characteristics of the pixels to classify directly. Due to the heterogeneity of medium-and low-resolution remote sensing images [as shown in Fig. 13(a) and 13(b)], each pixel is considered to be mixed and may contain more than one semantic category. Therefore, pixel-based classification methods are usually ineffective for medium-resolution and low-resolution remote sensing images, whereas object-based methods can achieve coarse image segmentation by classifying regions. There is less category mixing in high-resolution images, as shown in Fig. 13(c), where each pixel represents the characteristics of this area. Compared with the region-based method, the pixel-based method can give full  play to the characteristics of the pixel itself and perform the segmentation task more successfully.
Object-based classification: The core to be processed of the object-based classification method is the segment (segments), that is, the grouping of multiple pixels with the same attribute into an object. Unlike pixel-based classification methods, objectbased methods divide remote sensing images into separate regions and evaluate their characteristics by spatial and spectral features. Object-based methods are also more similar to the human visual understanding process, understanding semantic information by considering the different properties and spatial arrangements of these objects and then intuitively identifying objects from images rather than individual pixels. Currently, object-based classification of features is also used in archaeology, exploration of glacial landforms, wetland mapping, and other applications. Object-based methods usually consist of three main parts: image segmentation, object feature extraction, and object classification, as shown in Fig. 14(a). The image segmentation part, the first step of the object-based method, divides the remote sensing image into multiple homogeneous segments with segmentation algorithms, such as edge-based segmentation and region-based segmentation. The object feature extraction part makes up for the shortcomings of pixel-based methods, including features, such as shape, texture, and spectrum, are extracted. Finally, in the object classification part, different objects are classified by the classifier in their feature space.
In recent years, how to integrate deep learning and objectbased land-cover classification has attracted the attention of many scholars. Zhang [142] proposed an object-based convolutional neural network (OCNN) method for land use classification. OCNN first segmented remote sensing images into linear-based objects and general objects and then sent them into the neural network for analysis. Timilsina et al. [143] presented a new method combining the object-based postclassification refinement method and CNNs, which takes optical and SAR data as input and uses the CNN network to obtain coarse results, which are extracted with the help of OBIA. Spatial, texture, and context features refine the coarse results. Zhang et al. [144] proposed a multilevel context-guided classification method (MLCG-OCNN) for high-resolution remote sensing images. Instead of using object and context blocks as input, MLCG-OCNN accurately identifies objects using high-level features learned from spectral patterns, geometric features, and objectlevel contextual information. The classification results for each object are then improved with pixel-level contextual guidance. Papadomanolaki et al. [145] introduced a novel object-based deep learning system that incorporates anisotropic diffusion data preprocessing and an extra loss to integrate object-based priors.
Pixel-based classification: Pixel-based approaches employ image pixels as the basic unit of analysis, and individual pixels are labeled as a single semantic category, such as vegetation, buildings, vehicles, or roads [as shown in Fig. 14(b)]. Early methods based on pixel-by-pixel classification mainly adopted k-means, support vector machines, neural networks, and other methods. With the improvement of remote sensing imaging technology, the resolution of remote sensing images has been greatly improved. The pixel-based method completes the segmentation task by clustering the pixels with similar features into the same category and assigning a category through the pixels' features.
Peng et al. [146] proposed cross fusion net (CFNet) based on UNet. The CFNet network fuses and predicts the multiscale features in a concatenated manner. In addition, the network designs a channel attention refinement module to select informative features and a cross fusion module to expand the low-level feature map of the receptive field to improve the segmentation accuracy of small-scale objects. Heidler et al. [147] proposed the HED-UNet network, which exploits the multiscale features generated in the decoding process to provide features for both semantic prediction and boundary prediction tasks.
Liu et al. [148] constructed an atrous convolution module based on atrous convolution in the DeepLabv3+ network, which can arbitrarily control the depth, width, group, and step of the module with different dilation rates to make full use of local features. Peng et al. [149] used a multiscale convolution kernel parallel method to make full use of the local information of the pixel. Dense skip connections are adopted to mitigate the consequences of the loss of high-level features in the image due to the nature of convolutional low-pass filtering. Shang et al. [150] proposed atrous convolution with different expansion rates, the global information, and self-information for extracting multiscale contextual information to solve the problem of object size discrepancy in remote sensing images. Wang et al. [151] proposed a dual-channel spectral-spatial fusion capsule generative adversarial network (DcCapsGAN) for HSI classification. Dc-CapsGAN utilizes a capsule and generative adversarial network structure to overcome the limitation of training size with highdimensional features and the effectiveness of spectral-spatial exploitation.
A novel spectral spatial transformer-M that assembles spatial attention and extracts spectral features is proposed to improve performance for the class pixels located on the land-cover category boundary area [152]. Wang et al. [153] proposed an UNetFormer to model both global and local information for efficient semantic segmentation achieving up to 322.4 FPS with a 512 × 512 input. Inspired by multiscale vision transformer, He et al. [154] proposed a cross-spectral vision transformer to extract pixelwise multiscale features and enhance local details between neighboring spectral bands for HSI classification.
2) Change Detection: Remote sensing change detection (RSCD) refers to extracting and identifying different information between multitemporal images from the identical geographical area [155], [156]. As shown in Fig. 15, RSCD methods typically consist of the processes of remote sensing images preprocessing (alignment, correction, noise reduction, etc.), selection of suitable change detection method, and evaluating the results. Weismiller et al. [157] first performed change detection for coastal environments and since then a large number of studies have been conducted on RSCD. Nowadays, RSCD takes an active part in a variety of applications, including urbanization monitoring [158], damage assessment [159], and environmental monitoring [160]. According to the analysis units, the existing RS CD methods are classified into pixel-based, object-based, and scene-based, each of which has its own advantages and shortcomings [161]. In recent years, new approaches have also been developed to combine these analysis units in the process of change detection to better extract change information.
Pixel-based change detection: Since pixel represents the most basic unit of remote sensing image, early methods of RSCD mainly employed algebraic methods to evaluate every pixel of the given remote sensing images, such as the image difference method [162] and regression analysis method [163], [164]. Furthermore, RSCD can also be undertaken by means of pixel transformation, such as principal component analysis [165] and change vector analysis [164]. In pixel transformation methods, remote sensing images are transformed and combined with spatial projections and converted into different mathematical spaces for analysis in order to optimize various features further. Due to the unpredictability of high-frequency components in the high-resolution remote sensing image and errors of geometric alignment and radiometric correction in preprocessing, traditional pixel-based methods are hardly capable of modeling to apply to high-resolution remote sensing images [166]. Therefore, traditional pixel-based methods are typically adopted for low-and medium-resolution images [167], [168]. In addition, the pixel classification change detection method is another pixel-based change detection method that obtains the change matrix of an image by comparing two postclassification images, which reflect the change information in the study area [164]. Such methods include postclassification comparison, unsupervised change detection methods, and artificial neural network-based methods [167]. However, supervised approaches suffer from the difficulty of selecting high-quality datasets, whereas unsupervised approaches encounter difficulties in recognizing and labeling change objects and in selecting numbers of clusters [156], [164].
In recent years, the rise of deep learning has led to a large number of deep learning-based semantic segmentation methods being applied in pixel-based change detection and greatly eased the abovementioned difficulties. For example, Wang et al. [169] introduced a hybrid affinity matrix with fused subpixel representation and proposed a convolutional neural network framework for RSCD. Daudt et al. [170] used a FCN to perform change detection on multitemporal images on Earth observation images. As it has been proven that obtaining contextual information in multitemporal images and combining multiscale features of change regions provides an effective prediction of fine changes and improves the accuracy of change detection [171], research works combining multiscale features have been proposed. For example, Chen et al. [172] designed a multiscale feature convolution unit combined with deep siamese convolutional networks for supervised and unsupervised change detection. Moreover, aiming at further feature and information fusion. Zheng et al. [171] designed a cross-layer convolutional neural network (CLNet), which aggregates multilevel contextual information and multiscale features through two parallel branches.
Since CNN-based methods are not skilled in acquiring remote information in space, the transformer has also been introduced to remote sensing change detection. Chen et al. [173] proposed the dual-temporal image transformer (BIT), which expresses dual-temporal images as several labeled tokens, and the context is modeled in a compact token-based space-time with a transformer-based converter encoder. The learned global context-rich tokens are then fed back into the pixel space to enhance the original pixel-level features by the transformer-based decoder. A pure transformer network with a siamese U-shaped structure is also proposed to solve CD problems [174]. In addition, some scholars have also introduced graph convolutional networks [155], GANs, and DBNs, into pixel-based change detection [156].
Apart from RSCD based on active imaging, change detection in SAR images has also received attention from scholars. During SAR image change detection, since local pixels are coherent, it is critical to reduce the image of scattering noise while preserving the image details as local pixels are coherent. To address the challenges above, as shown in Fig. 16, Zhang et al. [59] presented the adaptive contourlet fusion clustering algorithm as well as a new FGFCM-based fast nonlocal clustering algorithm (FNLC) for SAR change detection, which leverages the change and invariant information from ratio difference images. Specifically, the contourlet fusion method of image fusion first decomposes two input ratio images by the CT, which in turn yields multiresolution and multidirectional decomposition coefficients. Then, different fusion rules are employed to fuse the low-and high-frequency coefficients of the input image, respectively. Finally, the fused coefficients were subjected to the contourlet inversion transform to acquire the fusion image. In addition, the proposed FNLC method classifies the changed and unchanged areas in the fusion image, enhancing the performance of SAR images in terms of noise suppression.
Object-based change detection (OBCD): Similar to objectbased classification, the analysis unit for OBCD is the object in images. Chen et al. [176] defined OBCD as a process of applying object-based analysis to identify variances in geographic objects at different times. Typically, it consists of the following steps: creating homogeneous regions (i.e., image objects) on the basis of image segmentation, extracting change information, and identifying change areas. The OBCD method is highly sensitive to the segmentation algorithm adopted and tends to disregard semantic information, as well as interobject information [177]. Also, the selection of the scale parameter (SP) used to control the object size is a fundamental step in OBCD. Traditional object generation methods based on mathematical approaches fail to solve these difficulties. Meanwhile, based on the accelerated growth of deep learning, OBCD methods have solved these difficulties to some extent.
In the process of object segmentation, both insufficient and excessive segmentation leads to the appearance of features that fail to reflect the real world and may produce useless objects, which may degrade performance [164]. The emergence of deep learning has made it possible to further fuse spatial features. Wang et al. [178] presented a method for change detection combining multiple feature integration methods, showing that multiple objects features yield higher accuracy in object-based methods with different segmentation scales and classifiers. In addition, superpixel segmentation methods are widely utilized to extract objects. Zhang et al. [177] proposed a superpixel enhanced CD network (ESCNet) for very-high-resolution (VHR) images to extract object information with a superpixel segmentation network. To further exploit the contextual information among objects, Zhan et al. [179] presented an unsupervised scale-driven network for VHR images with a multiscale decision fusion strategy. The network identifies change regions by fusing change detection results achieved by various scales from SVM-based classification. It also makes full use of the spatial contextual information of image objects. Zhang et al. [180] introduced the GCN model to remote sensing OBCD and constructed graph neural networks for objects to obtain contextual information between neighboring objects, enhancing performance and computational efficiency.
Bounding box selection is another object-based approach. Among such methods, object detection algorithms, such as Faster R-CNN backbone, are widely utilized, which consider the "changed regions" in the image as detection objects and the "unchanged regions" as background [161]. Zhang et al. [181] proposed a single-stage change detection model with a dual correlated attention-guided detector to enhance robustness. The input images are sent to a weight-sharing backbone to extract features at different scales. A constructed dual correlated attention module is following to refine the change-related features from the channel and spatial aspects and inhibit the uncorrelated features. Han et al. [182] proposed dual regions of interest networks, consisting of three functional blocks: a feature extraction network, a change proposal network, and a different judgment network, to improve feature representation and achieve better change discrimination. Priyanto et al. [183] applied faster R-CNN as a feature extractor to detect and monitor the number of changing floating net cages in fisheries and marine areas.
Furthermore, since both pixel-based and object-based methods hold their respective advantages, many scholars have combined them to achieve better performance. Lu et al. [184] proposed an unsupervised algorithm-level change detection fusion scheme that applies OBCD to improve the accuracy of the traditional pixel-based change detection algorithms. Ji et al. [175] employed mask R-CNN and MS-FCN to extract building features. As shown in Fig. 17, the building extraction network outputs object-and pixel-level building change maps, and feeds them to a self-trained building change detection network to compute building change maps. Han et al. [185] suggested a weighted Dempster-Shafer theory fusion method that generates OBCD by combining multiple pixel-based change detection results.
Scene-based change detection: Remote sensing scene level change detection (SLSCD) intends to analyze and identify land use changes in a given multitemporal remote sensing image of the same area from a semantic perspective [161]. Rather than pixel-and OBCD methods, SLSCD assigns land use/cover labels to image scenes, e.g., for industrial and residential areas. SLSCD is mainly deployed in the analysis of change at the semantic level, i.e., the shift in ground cover type, and is no longer focused solely on the question of whether the ground state has changed. A number of approaches have been proposed, which are broadly classified into traditional-based methods as well as deep learning methods.
Before the surge of deep learning, approaches utilizing handcrafted features were proposed successively, such as scaleinvariant feature transformation and a bag of visual words (BOVW) models. Wu et al. [186] presented an SLSCD framework based on the BOVW model and a classification-based approach to extraction semantic change information, in which scene images are represented by the word frequencies of three kinds of multitemporal learned dictionaries. To further exploit the time-scale information and compensate for the weakness of manual features, some scholars introduced unsupervised methods to SLSCD. Wu et al. [187] proposed a method that combines kernel slow feature analysis (KSFA), an unsupervised learning algorithm based on the fusion of KSFA and postclassification fusion, combining independent scene classification with change probability to identify scene changes and recognize transition types. Du et al. [188] proposed a latent Dirichlet allocation and multivariate alteration detection method for unsupervised scene change detection.
As a large number of remote sensing scene data samples with annotations are acquired, the traditional methods abovementioned show low robustness for large-scale datasets, and the whole scheme of some traditional methods fails to perform joint optimization [189]. Following the growth of deep learning, a number of researchers have introduced deep learning to SLSCD to break through these difficulties. Wang et al. [190] proposed a scene change detection network named DCCANet. DCCANet extracts convolutional features through a CNN and uses deep typical correlation analysis (DCCA) to learn the nonlinear transformation of two view data, which enhances the temporal correlation of multitemporal correlation of the temporal images and obtains highly correlated features.
3) Target Detection: The research of remote sensing image target detection has a broad application perspective. It can monitor the traffic conditions of important areas [191], roads, ports, and airports, and then coordinate the detection of aircraft in airports [192], vehicles on roads [193], and ships in ports [194]. However, owing to the complex information of remote sensing images and the small size of targets, detection methods based on natural images cannot achieve good results on remote sensing images. Therefore, a large number of methods have been proposed for object detection tasks in remote sensing image interpretation. Object detection focuses on whether there are object instances from a defined class given the input information, and if so, returns the spatial location, extent, and class of each object through a bounding box [195]. With the development of deep learning, thanks to the powerful semantic representation ability of deep features extracted by neural networks, the performance of target detection has been rapidly improved. Generally, deep learning-based object detection methods are mainly divided into two categories: two-stage detection frameworks and one-stage detection frameworks [196]. The difference between them is shown in Fig. 18.
Two-stage detection frameworks: The two-stage detector first generates region proposals and then classifies the candidate boxes. For object detection in remote sensing images, besides the limitation of training samples, the biggest challenge is how to effectively deal with the change of object rotation [5]. Li et al. [197] constructed a region proposal network including additional multiangle anchors and a local contextual feature fusion network to better extract the rotation and appearance blur features of spatial objects in remote sensing images. In addition to extending directly on classic two-stage detectors, such as R-CNN and faster R-CNN, many scholars have also proposed other two-stage methods according to the characteristics of remote sensing images. Zou et al. [199] designed SVDNet based on a singular value decomposition algorithm, and adopted feature pooling operation and linear SVM classifier for ship verification. Bai et al. [198] proposed an object detection method   [198]). based on time-frequency analysis for large-scale remote sensing images with complex backgrounds. They utilized wavelet decomposition for time-frequency transformation, which was then combined with deep learning for feature optimization. A feature optimization method based on deep reinforcement learning is proposed to select the main time-frequency channels. In addition, a discrete wavelet multiscale attention mechanism is designed to enable the detector to focus on object regions instead of the background, effectively extracting multiscale and multidirectional features from remote sensing images (as shown in Fig. 19).
Object detection has come a long way recently. However, the widely adopted horizontal bounding box representation is not suitable for omnipresent directional objects, such as those in aerial images and scene text. Xu et al. [200] proposed a simple and effective framework to detect multidirectional objects (as shown in Fig. 20). Instead of directly regressing the four vertices, it slides the vertices of the horizontal bounding box on each corresponding edge to accurately describe a multidirectional object. Zhou et al. [201] proposed a correlation learning detector based on transformer. It fully leverages the position information and correlation among objects, predicting the rotated bounding boxes for dense objects in remote sensing images.

One-stage detection frameworks:
The one-stage detection framework does not generate region proposals and obtains prediction results directly from the input information. Liu et al. [202] adopted the YOLOv2 architecture as the basic network for ship detection and proposed a remote sensing image ship detection framework for any direction. Based on RetinaNet, Yang et al. [203] proposed the R3det detector for the detection of rotating objects. The strategy combines the advantages of the high recall rate of horizontal anchors and the adaptability of rotating anchors to dense scenes and achieves feature alignment using a designed feature refinement module. Wu et al. [204] proposed the optical remote sensing imagery detector (ORSIm detector) with strong robustness using spatial frequency channel features, fast feature channel scaling, and other methods to make it capable of handling complex object deformation behavior in images.
Drawing on the idea of SSD [205], Ma et al. [206] presented an end-to-end scale-aware target detection framework for multicategory target detection tasks, such as large differences in the size of geospatial objects and dense distribution of geospatial objects in the same complex scene. The framework consists of a feature separation and remerging module, an offset error correction module, and a target saliency enhancement module. The feature separation and remerging module aim to eliminate the salient information of larger sized objects in the shallow feature map and highlight the features of small objects. Then, the effective detail features of larger sized targets are passed to the deep feature map to alleviate the problem of easy feature confusion between multiscale targets. The offset error correction module corrects the inconsistency of feature space layout between multilayer feature maps through the proposed offset loss function. The target saliency enhancement module enhances the target features of interest and suppresses background information through the proposed membership function. Finally, the multiscale feature maps containing fine target features are detected to obtain better detection performance (as shown in Fig. 21).
To address the challenge of complex background in remote sensing image target detection, Zhang et al. [207] proposed a foreground-aware remote sensing image target detection model, which enhanced the foreground awareness of the detector from the perspectives of feature relationship learning and network optimization. The method enhanced the discriminative ability of foreground regions in feature maps by building a foreground relation learning module and introducing a foreground anchor loss function to enable the network to focus on the optimization of foreground anchors. A dual network structure based on the transformer architecture is proposed to hierarchically embed the local features into global representations for object detection in remote sensing [208]. 4) Object Tracking: Video object tracking is a fundamental prerequisite for scene content analysis and understanding of high-level vision tasks. As shown in Fig. 22, it detects and tracks objects in image sequences. Object tracking is the process of detecting and tracking objects in an image sequence, during which the object is specified in the first frame and further detected and tracked in the next frame of the video [209], [210]. The main purpose of object tracking in the field of remote sensing is to track objects of interest in optical satellite video, aerial video, and UAV video. Remote sensing object tracking is used in intelligent traffic flow monitoring [211], environmental monitoring [212], UAV detection [213], etc. In this section, we focus on discussing the recently emerging object tracking algorithms on satellite videos. Object tracking in satellite video is far different from the natural video. First, satellites have a wide range, usually covering several thousand square kilometers in a single video. Taking Jilin-1 as an example, its resolution is about 1m, so a video has a video size of several thousand by several thousand. In remote sensing videos, objects of interest are often a dozen pixels in size with few appearance features, and it is difficult to distinguish objects by appearance features in complex scenes accurately. Therefore, when designing remote sensing object tracking networks, it is necessary to compensate for the scarce appearance features through perspectives, such as motion models.
Single-object tracking: Generally, generative methods and discriminative methods are the two mainstream single-object tracking frameworks. 1) Generative models: Generative tracking methods typically learn a model representing an object in the current frame. In the next frame, a candidate object that is most similar to the object is selected as the tracking result. The model maximizes the similarity or minimizes the corresponding reconstruction error [214], [215]. The object models of early generative algorithms include the Gaussian mixture model, Bayesian network model, Markov model, etc. Wang et al. [216] proposed a high-resolution ship tracking method from coarse to fine. A constrained template matching method was introduced in this method. Frost and Tapamo [217] presented a ship tracking model with shape priors, using level set segmentation to improve the detection performance. Although generative techniques are effective in the majority of the aforementioned scenarios, most of the current approaches only pay attention to the characteristics of the object itself and ignore its correlation characteristics with the environment or other  nonobjects. As a result, since 2010, academics have given discriminative-based approaches more attention. 2) Discriminative models: Discriminant tracking methods usually treat tracking as a binary classification problem of distinguishing the object from the background, thereby selecting the object [215]. Currently, discriminative methods represented by the correlation filter and deep learning have achieved satisfactory results and are widely used. The tracker based on the correlation filter extracts the object features according to the object position of the first frame of the video and performs training and learning to obtain the correlation filter, and the extracted features are subjected to Fourier transform, multiplied with the correlation filter, and then inverse Fourier transform, which improves the computational efficiency [218]. Du et al. [219] used the kernelized correlation filter (KCF) tracker [220], a classical algorithm in correlation filtering, for remote sensing video object tracking. According to the characteristics of remote sensing images, KCF is combined with the threeframe difference method to obtain more accurate tracking results. Shao et al. [221] combined KCF and optical flow to propose a VCF tracker for satellite video object tracking.
Since the object lacks appearance features, VCF uses the optical flow map as the object's velocity feature map and uses KCF to track the object on the velocity feature. Also, the inertial mechanism is designed to prevent model tracking drift adaptively by adopting the characteristics of object motion. A correlation filter-based dual-flow tracker is proposed to explore the spatial-spectral feature fusion and motion model for small object tracking [222]. Fu et al. [223] proposed a DRCF tracker based on a double regularization strategy to solve the detrimental boundary effect in DCF-based visual object tracking and enhance the discriminative power of the filter. Xuan et al. [224] introduced a rotation adaptive correlation filter tracking algorithm to address the tracking stability problem caused by the rotation of the object by estimating the rotation angle of the object. From the perspective of features, Liu et al. added deep VGG features on the basis of manual features to extract object features, and expanded the correlation filtering and tracking method of satellite video. An occlusion judgment index is proposed, and the motion trajectory is used to compensate for the occlusion.
However, the algorithms of correlation filter-based trackers tend to use handcrafted features, which often face challenges when the object size is small and the background is complex. Deep learning techniques provide a new research trend. The object tracking algorithm framework based on deep learning obtains the region of interest extracted features from the predicted position of the previous frame rate, and then establishes a deep network-based discriminant model to obtain the tracking result of the current frame of the object.
Compared with the fixed object positioning method of correlation filtering, the deep learning network acquires the positioning ability of object tracking through learning, which makes the algorithm more flexible. The most straightforward implementation of deep learning is to apply the pretrained model directly to remote sensing video tracking. For example, Hu et al. [202] proposed a CRAM network that combines deep learning and optical flow method. Appearance features and motion features are extracted from optical images and optical flow images to alleviate the tracking drift problem. Feng et al. [225] combined the classical algorithm SiamRPN++ of the Siamese network with the frame difference method based on clustering and put forward CDF-SiamRPN++. In CDF-SiamRPN++, the difference map between adjacent frames is divided by the clustering method, which effectively suppresses the interference of environmental noise and retains effective motion information. Shao et al. [226] presented the HRSiam tracker, which combines the highresolution feature extraction step by HRNet with the SiamRPN tracker. Since HRNet is capable of performing feature extraction and multiscale feature fusion while maintaining high resolution in parallel, applying the extracted high-resolution features to SiamRPN for object tracking leads to a powerful small-object tracking capability.
Song et al. [227] also proposed a tracker based on SiamRPN++. The tracker integrates spatial and channels attention to improve tracking accuracy. Li et al. [91] raised a CRFPF module to establish parallel branches to extract multiscale features, and a collaborative attention learning mechanism is designed to learn the relevant information enhancing the saliency of the objects. Also, an MBLT tracker is proposed to learn the motion and background of the object [228]. The tracking process of MBIT is shown in Fig. 23. First, the DCF tracker generates raw tracking results. Then, a prediction network based on FCN is proposed to estimate the location probability. Third, a feasible region is segmented by FLICM. Finally, the results of the abovementioned three modules are combined to predict the tracking results. To exploit the learning ability of the neural network, deep reinforcement learning is also introduced to track objects in satellite videos. Cui et al. [218] proposed an action decision-occlusion handling network to leverage the occlusion information and drive actions under occlusion.
Multiobject tracking: Compared with single-object tracking, multiobject tracking in remote sensing video has greater application prospects. Multiobject tracking in remote sensing video allows continuously monitor suspicious objects in the military and obtain enemy intelligence; for civilian use, it can monitor traffic flow for statistical analysis, and provide data support for urban management. In remote sensing video, multiobject tracking is divided into three categories: aircraft, ships, and vehicles. Because of the large object size of aircraft and the sparse and less obscured ships moving on the sea surface, few papers have performed multiobject tracking for these two types of objects. He et al. [229] designed algorithms for two types of objects, ships and aircraft in satellite video. This algorithm models multiobject tracking from a multitask learning perspective as a graph information inference process. Through the spatiotemporal relationship module of the graph, the algorithm can mine the potential higher order relationships in the graph.
Compared with the tracking of aircraft and ships, the multiobject tracking of vehicles has received extensive attention. Xiao et al. [230] considered the tracking problem a relational graph matching framework. A joint probability relational graph method is proposed to integrate the road map and the motion of the vehicle to obtain high detection and tracking accuracy in wide-area videos. Zhang et al. [231] proposed a two-step global data association algorithm: First, the local object trajectory of the vehicle is generated, and then, the local trajectory is merged into the global trajectory. The trajectory association model defines a trajectory transition matrix based on Kalman filtering to link trajectories with larger time intervals. At the same time, through the double-layer k shortest path optimization method, the approximate optimal solution to the association problem is obtained.
Ahmadi et al. [232] applied background subtraction to detect moving vehicles and estimate the trajectory, speed, and other information of the vehicle. Zhang et al. [233] also used background subtraction to detect moving vehicles and apply dynamic association methods to match the objects. Ao et al. [234] established a local noise model to distinguish vehicle objects through an exponential probability distribution. Jie et al. [235] proposed a cross-frame keypoint detection network (CKDNet) and a spatiotemporal motion information guided tracking network. CKDNet assists the detection of keypoints by collecting complementary information between frames and efficiently tracks densely arranged vehicles by building a two-branch long short-term memory. Wu et al. [236] presented slow feature and motion feature to guide the multiobject tracking, in which bounding box proposal-guided NMS modules based on SFs enhance the detection of regions of interest.

5) 3-D Reconstruction in Remote
Sensing: 3-D reconstruction is a fundamental challenge in the wide remote sensing applications [237]. In this section, remote sensing 3-D reconstruction is mainly investigated according to data sources and stereo matching algorithms, as shown in Fig. 24.
Different source data-based remote sensing reconstruction: According to the data source, the existing remote sensing 3-D reconstruction can be divided into the optical satellite-based, LiDAR-based, and UAV aerial photography-based reconstruction methods [238], [239], [240]. The digital surface model and 3-D reconstruction based on optical satellite technology are also called visual stereo mapping. It mainly uses optical remote sensing satellites to perform highprecision ground stereoscopic observations to obtain ground models. Similar to optical satellite imagery, the ground imagery acquired by the UAV aerial photography method is also visual. It has been demonstrated as an efficient and reliable tool to generate high-precision reconstructions and models of topographic and historical landscape structures [241]. Langhammer et al. [241] used drones to obtain images of abandoned landscapes built for wood flow, and then performs 3-D reconstruction, which is of great significance for water resource management.
For the 3-D reconstruction of optical images, some selfsupervised techniques can minimize the distance between the 2-D projection of the reconstruction result and the input image. Some unsupervised methods are based on generative adversarial networks to reconstruct 3-D shapes. By contrast, remote sensing images based on LiDAR scanning have high resolution and strong reliability. The seamless and accurate elevation data it obtains have many applications in the Earth sciences. The data obtained by the LiDAR point cloud device are point cloud data. Each point contains 3-D coordinate information and sometimes includes color information, reflection intensity information, echo frequency information, etc. In short, the digital elevation model, digital surface model,and digital orthophoto that can be generated by LiDAR are used in various aspects, such as urban 3-D modeling, natural disaster assessment, and resource survey.
In addition, the interferometric SAR tomography technology is also used to invert the scattering intensity of ground objects at different heights on the vertical ground, so as to perform 3-D radar imaging. Tomography technology makes it possible to reconstruct the vertical elevation and direction structure of ground objects and has great application potential in terrain mapping, forest parameter estimation, 3-D modeling of urban buildings, and imaging of historical relics.
Stereo matching: Stereo matching has specific research significance in 3-D reconstruction and has certain universality, so it has become a research hotspot of 3-D reconstruction. The general process of stereo matching is as follows. After the image is preprocessed, the idea of the global method (the path on the right in the abovementioned figure) is to use the global information to perform disparity optimization (disparity optimization). It seeks to find the optimal disparity result for each pixel so that the international and overall matching cost is minimized [242]. The disparity calculation in the local stereo matching algorithm is generally relatively simple, and the WTA winner-take-all theory is used to search for the disparity directly. Both methods need to perform parallax postprocessing after calculating the parallax drawing. After the disparity map is initially obtained, the results of the disparity map are judged, and possible matching errors are found and corrected. Common disparity postprocessing methods include left-right consistency detection, occlusion filling, and weighted median filtering. Stereo matching has certain research significance in 3-D reconstruction, and it has certain universality, so it has become a research hotspot of 3-D reconstruction. The general process of the stereo matching is shown as Fig. 25. After the image is preprocessed, the classic idea of using global information for parallax optimization is to find the optimal parallax result for each pixel, so as to minimize the global and overall matching cost [242]. This step is called disparity optimization. Disparity calculation has become one of the research focuses in existing stereo matching. Depth and disparity can be directly converted to each other, so depth estimation has also become a research hotspot of stereo matching.
Depth estimation: Depth estimation, using one or only/multiple viewing angles of the RGB image, estimates the distance of each pixel in the image relative to the shooting source. It is a critical step in the task of scene reconstruction and understanding and is part of the 3-D reconstruction. In addition to the costly method of obtaining depth point clouds by using LiDAR or the reflection of structured light on the object's surface, the most common traditional depth estimation methods are monocular and binocular ranges. In contrast, the amount of calculation of the monocular ranging method is complex, and the accuracy is not as high as that of the binocular, and it is often used when the conditions are challenging. Deep learning has also continued to develop in depth estimation methods.
1) Monocular and binocular disparity estimation: There are mainly monocular estimation and binocular estimation methods. There are many common deep learning monocular ranging methods. For example, Facil et al. [243] proposed CAM-Convs convolution, which can take the camera parameters into account, so that the neural network can learn to calibrate the perception mode. Wang et al. [244] proposed a motion feature that considers one of the most important features of the human visual system. It employs an RNN to train with multi-view image reprojection techniques to improve monocular depth estimation. Tosi et al. [333] proposed monoResMatch, which combines features from different angles, keeps consistent with the input image, and performs stereo matching between two cues to infer from a single input image to the novel deep learning framework. The overview in [245] investigates deep learning binocular depth estimation methods and gives a comparison of 16 deep learning depth estimation methods, including the GANet [246], PSMNet [247], and SegStereo [248] in 2018 and 2019. In recent years, some relatively advanced methods have also appeared, such as PlaneMVS [249], Nerfingmvs [250], and in [251] and [252]. The architecture overview of PSMNet is shown in Fig. 26. It is a typical model for binocular disparity estimation. The left and right images are the model's input. CNN is taken as the feature extraction module along with the SPP module for feature harvesting. Then, the extracted features are concatenated together as the input of the cost volume module. Finally, a 3-D CNN with unsampling and regression module is designed for the cost volume regularization and disparity regression.
As for the disparity estimation in remote sensing [253], [254], [255]. Among them, Yu et al. [253] mainly uses 2-D discrete wavelet transform to enhance the local invariant features of the existing weighted α-shape (W α SH). It is used in remote sensing images with less affine distortion and less noise. Experiments perform that it can effectively alleviate the image matching problems of geometric distortion and radiation distortion in stereo remote sensing images. In addition, a novel edge-aware bidirectional pyramid stereo matching network is suggested in [254] to enhance performance in textureless regions while preserving the primary structure. It can effectively solve the problem of poor disparity estimation accuracy caused by occlusion areas of high-rise buildings and textureless areas. Jia et al. [255] tried to use CNN to match remote sensing stereo images of featureless areas, such as the lunar surface. 2) Single-view and multiview 3-D reconstruction: From the perspective of view, existing related algorithms can be divided into single-view and multiview 3-D reconstruction methods. Single-view 3-D reconstruction refers to the realization of 3-D reconstruction of an image or target given a single image. The majority of single-view 3-D understanding techniques currently in use employ an encoder-decoder structure, where the encoder converts the input image into a latent representation and the decoder must engage in complex analysis of the 3-D structure of the output space [256]. Although single-view 3-D reconstruction can generate different 3-D results (such as point clouds or meshes), it can also handle many disordered images. Remote sensing image reconstruction in three dimensions is essential for tracking changes to the Earth's surface. In [257] and [258], the authors orderly predict depth from a given image and estimate a single-view spherical map from depth under the same view. However, single-view 3-D reconstruction results usually lack completeness and accuracy, especially when there are obstacles or occluded areas. Multiview 3-D reconstruction alleviates and solves the abovementioned problems to a certain extent. There are two main types of multiview reconstruction: one is to reconstruct stationary objects from images of two or more views, and the other is to reconstruct 3-D shapes of moving objects from video or multiple frames [259]. Multiview reconstruction is flexible and scalable, which can be adapted to large-scale scenarios. Of course, there are still numerous obstacles to overcome in order to accurately rebuild multiview depth maps in urban landscapes, such as the presence of repeating textures and texture-poor places. To address the aforementioned issues, Hu et al. [260] proposed a multiview 3-D reconstruction (IMGTR) method based on image triangles. Rupnik et al. [261] proposed to generate high-quality digital surface models by combining many depth maps that were calculated using a dense image matching method. It performs well at reconstructing surface discontinuities, repeating patterns, and nontextured surfaces.

A. Public Datasets
Deep learning algorithms have demonstrated excellent performance in various fields. This is inseparable from using large amounts of finely labelled data for neural network training. Researchers need to use labeled data to develop algorithms to meet different applications. Commonly used remote sensing datasets are summarized in Table II and categorized according to the tasks for which they are mainly applied.

B. Software Platforms
In recent years, Earth observation technology has developed tremendously, and large-scale remote sensing data are stored, recorded, and developed for free use by society and researchers [262], [263], [264]. However, traditional remote sensing interpretation methods require users to download and process data on local computers. For example, image processing platforms, such as the Environment for Visualizing Images (ENVI), can perform image enhancement, orthorectification, data fusion and transformation, knowledge-based decision tree classification, and other functions on the image after the user obtains the data. This platform is an offline software installed on a single machine that assists people in data preprocessing and simple image recognition tasks [265]. With the increased data, the computing power to store and interpret data locally is facing enormous challenges.
Platforms for remote sensing applications have started to move toward the cloud as the Internet has grown [266], [267]. The remote sensing platform deployed in the cloud has the following characteristics.
1) The cloud platform can provide abundant storage and computing resources. Users can efficiently process large-scale remote sensing data; 2) Computation-intensive tasks are performed through cloud servers, reducing the computing power requirements of the user's computer and lowering the threshold for software use. 3) Users can access the platform by any device which can access the Internet and perform tasks, such as remote sensing image processing and analysis anytime, anywhere. 4) Accessing the platform through web pages, users can obtain the latest data and update functions of platforms at any time and use the latest algorithms to process the latest data and improve work efficiency. The abovementioned advantages are not available in traditional image processing tools. Therefore, various research institutes and companies have invested in constructing remote sensing cloud platforms. This platform can perform interpretation services in the cloud without downloading the data locally. These platforms integrate various tools and applications to provide users with a complete data acquisition and processing solution. From the perspective of usage, the mainstream platforms can be classified into two types: one is the remote sensing data cloud platform for professional users with programming tools. This platform requires users to use the provided application programming interfaces (API) for data manipulation and processing, such as Google Earth Engine (GEE). Through various flexible APIs, professionals can customize functions and algorithms for their own needs to achieve different functions. The other type is the remote sensing data cloud platform for ordinary users. This type of platform further encapsulates data and algorithms. Users only need to select or upload data in the corresponding format and select the task to be interpreted. The platform will be able to realize automatic algorithm processing and visualization of data and results, such as Remote Sensing Data Intelligent Interpretation Platform, SenseEarth, and so on. Through simple and convenient operation, ordinary practitioners can also interpret remote sensing data, which benefits the civilian promotion of remote sensing technology. In this section, we select the GEE platform and the Remote Sensing Data Intelligent Interpretation Platform for introduction and show the specific characteristics of these two types of platforms, respectively.
1) Google Earth Engine: GEE is a remote sensing interpretation cloud platform launched by Google in 2010. This platform is one of the most popular big data geographic information processing platforms. The platform provides users free services to discover, analyze, and visualize big geospatial data based on Google's computing infrastructure.
In GEE, different third-party network applications can be implemented through the interfaces provided by the platform. For researchers who use the GEE platform, it is essential to use the API provided by the platform. GEE provides APIs in two languages, JavaScript and Python, to meet the needs of most programmers. Through different APIs, users can easily access data, use various applications provided by GEE, and view the running results in real time. The platform is divided into three parts: Data catalog and Explorer, Code editor, and Timelapse.
Data catalog and explorer: The data catalog contains a significant amount of geospatial data, which collects numerous publicly accessible satellite images, including the Landsat, MODIS,  and Sentinel images, as well as numerous atmospheric, meteorological, and vector datasets. The datasets cover various satellite and air systems for optical imagery, environmental variables, weather and climate forecasting, land cover, and socioeconomic. Fig. 27 is the data content page captured by the multispectral instrument of the Sentinel-2 satellite in the data catalog. This content page shows visual thumbnails of the data, the time when the data are available, the dataset provider, the API used to access the data, and a detailed description of the data. Users can browse the data catalog, select the required dataset according to the dataset description, and use the provided API to obtain the data. The data can then be quickly visualized via explore, provided by GEE.
Code editing platform: The code editing platform is GEE's main platform for data acquisition, processing, analysis, and visualization. As shown in Fig. 28, the code editor is mainly divided into four functional blocks: visualization area, script manager, code editor, and information bar.
Below the page is the visual area of the code editing platform. This is the main area for user interaction, data, and result visualization. This area uses the world map as the base map to provide basic geographic location information. The data and code analysis results are displayed by stacking of multiple layers. Users can drag and zoom the results in the visualization area, mark the position by clicking, and so on. The location information of the marker will be displayed on the Inspector page of the information bar. On the left-hand side of the page is the script manager, which stores scripts edited by users and sample scripts provided by the GEE platform. Through the manager, users can select or delete their scripts. At the same time, the sample scripts provided by GEE cover image acquisition, preprocessing, visualization, drawing, etc., and provide demos, such as classification, climate modeling, terrain visualization, etc., to provide users with complete code usage demonstrations.
In the middle of the page is the code editor area through the cloud platform infrastructure provided by Google. By editing JavaScript and Python code, users do not need to consider the problem of the code running environment. Someone can run the code directly by clicking the "Run" button at the top of the page after writing the code.
On the right-hand side of the page is the information window. The info window includes Inspector, Console, etc. The Inspector displays information about the user's markers on the map. The Console will display the output print of the code running.
Simple mathematical operations to sophisticated image processing and ML functions are all available on the platform.
By writing code, users can fully utilize the functions of the GEE platform. The GEE platform provides rich data and API, the focus of its widespread use. However, since it is free and open to the public, computationally intensive tasks, such as deep learning, cannot be widely supported. Users are limited to varying degrees in training models, data acquisition, and designing new methods and functions.
Timelapse: Based on nearly 40 years of data stored on the GEE platform, the Timelapse project generates scalable video worldwide. The project stitches together one image annually into a video for each region, showing people the Earth's changes in time and space. In this project, we can record the most realistic records of natural and human activities, such as glacial fusion, bushfires, and urban development.
2) Remote Sensing Data Intelligent Interpretation Platform: Different from Google Earth Engine in Section V-B1, "Remote Sensing Data Intelligent Interpretation Platform" is designed to meet practitioners' need to interpret remote sensing data. By encapsulating the relevant functional blocks, users can straightforwardly operate the platform. With the help of artificial intelligence algorithms, the platform integrates available blocks, such as data interpretation, data management, and scene application, which realizes algorithm processing automation and data interpretation results visualization. The platform can perform real-time extraction and identification of target information from full-modal remote sensing data, such as panchromatic, visible, multispectral, hyperspectral, SAR images, and satellite videos. Currently, the platform has opened four primary functions, including land-cover classification, object detection and recognition, element change detection, and intelligent video interpretation, which offers practitioners technical assistance for processing data from remote sensing.
As shown in Fig. 29, the system architecture of the "Remote Sensing Data Intelligent Interpretation Platform" comprises three parts: data storage layer, platform service layer, and platform operation layer.
The data storage layer mainly contains user, configuration, and image data. User data record relevant information of users. Configuration records relevant information of remote sensing data. Image data include public datasets provided by the platform and private datasets uploaded by users that are only visible to owners. Image data are all stored in the cloud, which significantly reduce the pressure of user data storage and can quickly provide data support for interpretation tasks.
The platform service layer mainly includes three parts: data management service, data interpretation service, and task management service. The data management service manages the user data, image data, and interpretation results. It can operate the data in the cloud with the help of the instructions of the platform operation layer. The image interpretation service integrates various artificial intelligence algorithms. It determines the interpretation tasks through the platform operation layer and then efficiently completes tasks, such as land-cover classification, target detection, change detection, and video target tracking. The task management service is mainly responsible for data retrieval, parameter transfer, and task scheduling. When the user creates multiple tasks through the platform operation layer, the layer needs to schedule the tasks and provide them with the corresponding initialization parameters and image data.
The platform operation layer mainly consists of user authentication, user data, and task processing operations. User authentication operations can use the user information stored in the cloud for authentication and give users operation privileges. User data operation can read and modify user data, image data, and interpret results in the data storage layer. The task processing layer is mainly responsible for assigning tasks to the platform operation layer and providing feedback on exception information and log information from the platform operation layer.
Based on the abovementioned architecture, the platform contains two critical systems: the User Interaction System (UIS) and Data Interpretation System (DIS).
The client is mainly the interface between the user and the platform. Users can access the client through a browser to perform data uploading, browsing, interpretation task execution, and analysis and display the result. The server is responsible for data storage management and performing different interpretation tasks.
UIS: The UIS is the core system for users to interact with the platform. Users can use the Internet to access web pages at any time and enter the UIS to perform interpretation tasks after logging in and authenticating. Fig. 30 shows the system operation page after login. The system operation page is divided into four areas: task module, data list, data display area, and function module. In the task module, users can choose the type of task they want to perform. In the data list, public datasets and privately uploaded remote sensing images are displayed in thumbnails; the data display area will display the remote sensing images selected by the user and the corresponding remote sensing images in real time. Interpret the result. Users can drag and zoom in this area for data browsing. The function options provide users with functions, such as "image transparency selection," "visualization channel selection," "image zooming," "interpretation result selection," and so on.
DIS: As the core of the Remote Sensing Data Intelligent Interpretation Platform, the DIS is mainly responsible for intelligently interpreting remote sensing images, efficiently and accurately mining the adequate information of remote sensing   TABLE III  AVAILABLE TASKS IN BIG DATA INTELLIGENT INTERPRETATION PLATFORM images, and providing users with real-time analysis services of remote sensing data. The platform contains four primary tasks: land-cover classification, object detection and recognition, element change detection, and intelligent video interpretation. Each task is divided into subtasks according to the target type and data source, such as SAR, visible, multispectral, and hyperspectral. Land-cover classifications are divided into road classification, water classification, building classification, and land-cover classification. Object detection and identification are divided into aircraft, bridge, and ship detection. Video intelligent interpretations are divided into single target tracking, multitarget tracking, and motion target detection. The available tasks are given in Table III. Fig. 31 shows the interpretation results of some tasks, such as change detection of SAR, ship detection of SAR, water classification of HSI, object tracking, and multiaircraft tracking.

C. Hardware Systems
In conventional research, researchers usually use multiple graphics processing units (GPUs) or computer clusters for algorithm research [302], [303], but they ignore the constraints of energy consumption and computing resources. Although many algorithms can achieve excellent results under GPU acceleration, there is still a long way to go from the requirements of the actual industry. Many complex models cannot be deployed on small devices or computed in real time, which are the main problems confusing many engineers.
In applying remote sensing algorithms, the research and development of hardware systems are more urgent. Currently, most remote sensing algorithms are calculated at ground computing stations, which significantly affects the application of remote sensing technology and the complete mining of remote sensing data. The main existing requirements are divided into three points. 1) Real time: Real time can also be called nondelay, which requires equipment to have a fixed processing time when processing data to ensure stable processing of data streams. 2) Data volume: The amount of data captured by remote sensing satellites are significant, and not all data can be sent to the ground for the processing. This requires hardware devices that can be mounted on aircraft and satellites for processing and only transmit essential data to improve data utilization efficiency. 3) Power consumption: Airborne and satellite-based devices require low power consumption due to batteries and other power supplies. Low power consumption can prolong the use of electricity. Therefore, this chapter summarizes the mainstream hardware platforms and selects field programmable gate array (FPGA) devices that are easy to develop, computationally stable and low power for further research. 1) Classification of Hardware Systems: All chips capable of running AI algorithms, including CPUs, can be called AI chips. In the traditional von Neumann structure, each instruction executed by the CPU needs to read data from memory and operate on the data according to that instruction [304]. From this feature, the primary responsibility of the CPU is not only data operations, but also executing commands, such as memory reading, instruction analysis, and branching. However, most AI algorithms, especially deep learning algorithms, usually require a lot of data processing. When the CPU executes the algorithm, the CPU is limited to serial execution, which will spend a lot of time reading and analyzing data/instructions. This is why algorithms cannot be suitable for parallel processing intensive data and cannot fully utilize the chip's potential. Therefore, the computing framework is usually performed heterogeneously, combining a CPU and a computing card. The CPU performs data reading and other operations on the data, and the computing card implements large-scale and intensive mathematical calculations. Generally speaking, AI chips refer to chips that are different from CPUs and are especially designed for acceleration according to the characteristics of artificial intelligence algorithms. According to the technical architecture, it can be divided into GPU, application-specific integrated circuit (ASIC), FPGA, and neuromorphic computing chip [305] (as shown in Fig. 32).
GPU: The GPU has a relatively straightforward architectural design. As a result of the majority of transistors forming several dedicated circuits and pipelines, the GPU outperforms the CPU in terms of computation performance. The GPU also has strong floating-point computing capabilities, which can help deep learning algorithms overcome the computing pressure and release the full potential of AI. GPU development has reached a relatively mature stage at this time. GPUs are being used by businesses, such as Google, Facebook, Microsoft, Twitter, and Baidu, to analyze image, video, and audio assets to improve search engines and image intelligence software. In addition, GPU is appropriate for various industries, such as VR/AR and unmanned driving. But GPUs also have some limitations. Training and inference are the two phases of the deep learning algorithmic process. The GPU platform is a productive platform for training algorithms. However, when processing a single input for inference, the benefits of parallel computing cannot be completely realized. The GPU also consumes a lot of power and cannot work independently. A CPU is required to schedule it to work.
ASIC: The ASIC is a specialized customized chip designed to meet a particular requirement. For high-performance, lowpower mobile applications, customized features benefit ASICs' performance-to-power ratio and have advantages in terms of reliability and integration. Google's TPU, Cambrian Chips, Horizon's BPU, and Amazon's Inferentia are all ASIC chips. Artificial intelligence applications are ideal for ASIC devices.
First, the fully customized circuit of ASIC can boost performance. Google's TPU is 30 to 80 times quicker than CPU and GPU solutions while using less power and space. Second, downstream demand encourages the specialization of artificial intelligence chips. Due to the real-time requirements and the privacy of training data, the computing of many application scenarios cannot wholly rely on the cloud. The local software and hardware must support it. However, the long design cycle of ASIC cannot accommodate the advancement of the algorithms that restrict its use.
FPGA: The full name of FPGA is "field programmable gate array." Two characteristics can be identified when comparing FPGA and CPU. First, the FPGA does not have the storage brought by memory and control. Thus the data reading is quicker. Second, it uses less energy because the FPGA does not need a reading command. At the same time, FPGA is different from GPU. FPGA provides more pronounced efficiency improvements in specific applications thanks to its parallel pipeline and data parallel processing capabilities. FPGA is frequently employed in the inference phase of deep learning algorithms because it is ideal for data processing on the hardware pipeline and has excellent operation performance. In addition, FPGA provides the advantages of design flexibility and speed over ASIC. The modification of the algorithms can be easily deployed in the FPGA without redesigning the circuit. Because of its flexibility and performance, it frequently replaces ASIC in various industries.
Neuromorphic computing chip: A neuromorphic computing chip is a circuit simulating the computing mechanism of the brain from a structural perspective. This technology is still in the development stage. Its research work can be further divided into two levels. One is the neural network level, which corresponds to the neuromorphic architecture and processor. Its memory, CPU, and communication components are fully integrated, and information processing is carried out locally, eliminating the usual speed bottleneck between computer memory and CPU. Neurons can readily and swiftly communicate with one another. These neurons will activate simultaneously as long as they receive other neurons' pulses (action potentials). The Truenorth chip from IBM and the Tianji chip from Tsinghua serve as examples. The second is the level of neurons and synapses, and the corresponding innovation is the level of components. For instance, the world's first artificial stochastic phase-change neurons, capable of achieving high-speed unsupervised learning, were produced by IBM Zurich Research Center [306]. Although neuromorphic computing chips are not yet completely developed and there is still some distance between large-scale applications, it has the potential to revolutionize computer architecture.
2) FPGA Structure and Advantages: As early as the 1960s, Gerald Estrin proposed the concept of reconfigurable computing. But it was not until 1985 that Xilinx introduced the first FPGA chips. Although the parallelism and power consumption of the FPGA platform is excellent, the platform has not received much attention due to its high reconfiguration cost and complicated programming. Unlike GPUs and CPUs under the Von Neumann-style architecture, although FPGAs are more difficult to develop, they still have many advantages. The following is discussed in five aspects [307].  Fig. 33), it can be divided into independent parallel processing of data blocks, internal serial calculations;, overall data processing, parallel internal calculations, and parallel processing of data blocks and parallel internal calculations.
(1) Data parallel, calculation serial: It is suitable for the weak correlation between each data block, which can be operated independently, and there is causality between the operations of each step. Remote sensing images can be used to observe a certain area, usually with large image width and high data volume. Remote sensing data can be used to examine a specific location, often with big image width and high data volume. Data parallelism and computation serialization are popular parallel technologies that are basic and easily scalable. Li et al. [308] deployed the large-scale remote sensing real-time tree canopy detection algorithm on the FPGA and divides the original largescale scene data into small blocks. It optimizes and adjusts the original method based on a maximum local filter to reduce the utilization of FPGA, reduce idle cycles, and achieve a balance of different resource utilization. Ortiz et al. [309] proposed a parallel endmember extraction method for on-orbit HSIs based on the Fast UNmixing algorithm. This method divides the original HSI into fixed-size subimages and iteratively extracts endmembers from the subimages. This technique can be applied broadly in various computer settings and is very scalable regarding varied processing performance and energy efficiency. In addition, the block-based partition scheme can provide higher fault tolerance, which is suitable for remote sensing satellite environments with high space radiation and vulnerable hardware. González et al. [310] implemented the target detection method based on the orthogonal projection operator ATGP-OSP on FPGA. This article analyzes the orthogonal projection operator, in which the operation of matrix inversion can be highly parallelized by the Gauss-Jordan elimination method. A memory access module is designed in the system, the delay of input and output communication is reduced by prefetching technology, and the operation efficiency is improved. Báscones et al. [311] applied low complexity predictive lossy compression to HSI compression. The image is processed in parallel in blocks, and the iterative optimization process of each spectral channel is highly streamlined. A large number of FIFOs are used, which significantly reduces the use of DSP at the expense of slightly increasing memory, compressing the HSI in real time that satisfies several quality requirements.
(2) Data are processed as a whole, and the internal calculation is parallel: It applies to the relationship between each data block, and each step operation can be performed independently. González et al. [312] proposed a method to implement pixel purity index PPI on FPGA. The calculation of endmember string projection in the PPI method is independent and can be executed simultaneously, so it is very suitable for parallel processing. In addition, the calculation of the dot product in the endmember string projection can also be performed on a pixel-by-pixel basis. That is, data parallelism can be realized. However, since this method requires additional computing resources to process intermediate results, which makes the clock cycle longer, only the process of each endmember string and pixel dot product in the operation process is performed in parallel.
(3) Data are divided into blocks, and operations are parallelized: This method can theoretically utilize computing resources most efficiently. However, data distribution and integration costs must be considered in practical applications. Lei et al. [313] further analyzed the ATGP method based on data parallelism and proposed a vectorization method of operator matrix. The operation of vector projection is calculated in parallel so that the update of the operator in a vector only needs to be in one step. The computation time is significantly reduced. The execution of convolutional neural networks exhibits a high degree of parallelism. Pixels at different locations can be processed in parallel, whereas standard convolutional layers contain multiple filters. But due to hardware limitations, it is impossible to utilize all parallel modes fully. Therefore, the authors in [314], [315], and [316] divided the filters into multiple groups for operation. When computed, the grouped filters are moved along the channel dimension, and intermediate results are stored in the accumulation buffer until the end of the channel gets the convolution result at the current position.
Moreover, the channel convolution operation abovementioned is carried out simultaneously at the different pixels, and the result is the feature map that the current convolution layer has processed. Zhang et al. [317] proposed an independent dual-channel DDR hierarchical storage scheme for storing and reading weight parameters and feature data. The scheme uses ping-pong buffering technology to avoid the conflict between the storage of output feature maps of each layer and the access of input feature maps.
The algorithm processing efficiency on hardware is improved to solve the problem that FPGA storage and bandwidth are challenging to match in the parallel implementation of the CNN network. It solves the problem of poorly matching FGPA storage and processing bandwidth, which improves the efficiency of arithmetic processing on the hardware. Zhang et al. [318] proposed a three-level memory access architecture, including off-chip memory, on-chip buffer, and local storage. The CNN's parameters are stored in off-chip memory. The convolution processing engine receives picture data from the input buffer. There is no way to set up enough hardware modules to calculate the entire layer at once due to the limitations of hardware logic and memory resources. Each convolutional layer often has several convolution process engines, each with a local memory for storing intermediate results.

VI. TOP TEN OPEN PROBLEMS
The application of deep neural networks in remote sensing has become a major trend. However, modern deep learning still has many unsolvable problems. Since humans can deal with all kinds of complex tasks dynamically, brain-inspired algorithms are new research paradigms. With the study of the idea of brain properties, it can effectively make up for the current problems of deep learning. By reviewing the brain properties and current development of the remote sensing image interpretation, we summarize ten future research directions and challenges.

A. How to Design Brain-Inspired Algorithms That Mimic Brain Structure?
The structure of the human brain is hierarchical, sparse, and periodic. At present, the algorithms designed in the field of remote sensing follow a fixed structure. For example, convolution is widely used in image processing tasks to extract features, which realizes a simple simulation of the bottom layer of human brain vision. In addition, the connection of neural networks are dense. In the human brain, however, the underlying visual layers are sparse. The design of a neural network can partly meet the task requirements, but it is still far from the brain structure.
The spiking neural network [319] is a neural network that further simulates the structure of the human brain. It accumulates on neurons through information flow to achieve signal activation and inhibition. At the same time, this structure is closer to the structure of the human brain, thereby realizing sparse connections in information processing. Capsule network [320] also models neurons, representing pose information of features through vectors.
These algorithms that mimic brain structure have been extensively studied in natural data. However, due to the complex characteristics of remote sensing data, brain-inspired algorithms still need further exploration in of remote sensing.

B. Interpretability of Brain-Inspired Remote Sensing Algorithms
Currently, using neural networks to improve the accuracy and efficiency of algorithms is the mainstream method. However, the inner mechanism of neural networks and the choice of parameters have not been well studied. This leads to the fact that the results of the algorithms are not completely credible and reliable in the actual environment. Therefore, the core research of brain-inspired remote sensing is to mimic the cognition, perception, and other abilities related to the human brain to propose the algorithms with high interpretability.
There are very little research works on the interpretability of existing remote sensing algorithms. Hong et al. [321] discussed the development of interpretable hyperspectral artificial intelligence algorithms from the perspective of nonconvex modeling optimization. Many shallow algorithms can be explained by combining them with knowledge of physics. However, the interpretability research of deep algorithms is still a very difficult problem. Guo et al. [322] used the interpretable CNN framework [323] to prune network. This class of methods adds additional losses to the filters in the network to achieve interpretable learning for different classes. In addition, the transformer leverages the attention to build the neural network. It also shows the ability, such as our brains, to successfully handle a disordered flow of information [324]. Furthermore, the attention map is also shown interpretability.
These studies can improve the interpretability of the algorithm to a certain extent. Future remote sensing algorithms still need to combine remote sensing algorithms with brain properties and physical knowledge to improve interpretability.

C. Constructing the Causal Reasoning Ability of Brain-Inspired Remote Sensing Algorithms
The brain is a complex, intelligent structure using knowledge and facts to reason and make conclusions. It makes inferences about things based on perceptions acquired by different organs. These abilities all boil down to causal reasoning. As an emerging theory, causal inference has gradually formed its theoretical system to guide the algorithm design of artificial intelligence.
Currently, in the interpretation of remote sensing data, there are also many researchers trying to add the ability of reasoning to the design of the algorithm. Mou et al. [325] designed a spatial correlation module to construct long-range correlations of objects in the scene. This module can provide relation-enhanced feature representation to improve the accuracy of semantic segmentation. Cao et al. [326] also tried to model and reason about global relational information. This method improves the performance of HSI denoising from the perspective of spatial pixels and channels. The relational reasoning network [327] was proposed in Salient Object Detection in optical remote sensing image. These methods all focus on designing a network structure, constructing the relationship between feature channels, and realizing reasoning about the data. Therefore, brain-inspired algorithms based on reasoning are still in the early stage.
Now deep learning needs to move forward from data-based to knowledge-based. As an essential way to utilize knowledge, causal inference is the focus of brain-inspired algorithm research. The causal inference has three important hierarchies: association, intervention, and counterfactual. These theories formulate the reasoning and decision of human brains. Combining these theories with remote sensing data interpretation tasks will effectively promote the performance of remote sensing interpretation tasks and improve the interpretability of remote sensing algorithms.

D. Generalization Ability of Remote Sensing Algorithms
Remote sensing data have diverse and complex characteristics, but current algorithms can only handle the task of a single dataset. Even processing the same task, a model cannot be applied to data captured at different ground sample distances (GSD), spectral resolutions, and times. Therefore, it is a waste of resources to train a model to adapt to different data repeatedly. The human brain has strong learning and generalization capabilities. By imitating the learning and memory capabilities of the human brain, we can design dynamic networks for learning and utilizing a variety of data and improve the migration ability of the algorithm in a variety of data.
At the same time, remote sensing image interpretation involves various tasks, such as classification, detection, tracking, and so on. Most algorithms are designed to deal with a single task. However, there is a certain correlation between each task. The brain can use the knowledge of relevant tasks to assist the interpretation of the current task, thereby improving accuracy and speed. For example, the knowledge of the relationship between planes and airports can help us ignore the irrelevant area, achieving rapid localization of the planes. The fusion of these tasks requires a unified brain-inspired remote sensing to perform joint learning of multiple tasks and simulate the mechanism of human information utilization to realize the complementarity of each task.
From another perspective, the remote sensing data collected are always a small set compared with the entire Earth. In the open world, the performance of algorithms is still difficult to estimate and suffers. The human brain has the ability to discriminate unknown types of objects. For unknown objects or categories, it can give the uncertainty of the result so that different strategies can be applied to the uncertain data. This estimation of uncertainty is of great significance in the practical use of remote sensing algorithms. In the natural field, there have been many studies related to open-set data. Such algorithms can identify unknown samples and separate them into unknown classes [328], [329]. Therefore, the algorithm design of the open set is also an important part of the design of brain-inspired remote sensing. It requires the algorithms to face the data from the open world outside the training set, with the ability of self-adaptation, self-induction, self-learning, and the ability to deal with uncertain results. Judgment can predict reasonable results according to the geographical conditions of different regions and regions.

E. How to Implement a Remote Sensing Algorithm With Temporal Memory and Self-Learning?
The observation of remote sensing information is a continuous process. The satellites capture the images in a certain periodicity. By regularly capturing local areas, a series of temporal observations are formed. Existing remote sensing algorithms usually only consider the performance of interpretation in a single image, or obtain the changed area through two images. However, geographic information is in a time-series relationship and continuous change. Only interpreting a single image does not have the ability to predict future changes. Therefore, designing memory capabilities and autonomous learning in the algorithm is the exploration direction of future brain-inspired remote sensing. Based on brain-inspired algorithms that memorize and learn from continuous data, it is possible to predict future situations. According to the prediction results, we can dynamically adjust the capturing frequency of satellites in different areas, realize more intensive observation of high-risk areas, and improve the ability of remote sensing algorithms for disaster early warning.

F. How to Utilize Large-Scale Unlabeled Remote Sensing Data?
We have acquired a large amount of remote sensing data with the increasing number of satellites. However, modern deep learning algorithms rely on massive amounts of labeled data for supervised training. This requires a lot of manpower and resources. In order to utilize a large amount of unlabeled data, semisupervised and self-supervised learning has become a new research trend.
Semisupervised learning combines supervised learning and unsupervised learning. It uses a small amount of labeled data to train a basic model to explore a large amount of unlabeled data. Self-supervised learning is to use the consistency of multiple views of data to train the network. It constructs multiple views of a single target by random augmentation or other strategies and brings considerable performance.
In the field of remote sensing, multisource data naturally constitutes a multiview representation of a target, meeting the need for unsupervised and self-supervised. While using unlabeled data, the interference caused by natural factors, such as cloud occlusion and multisource data matching errors, also needs to be considered.

G. How to Integrate Multimodal Dynamic Data for Interpretation?
In order to monitor the Earth comprehensively, satellites carry sensors with various GSD and imaging methods. The diverse data collected by these sensors bring great challenges to the design of algorithms.
At present, it is mainly to use a data fusion algorithm to improve the performance of the model by using multimodal data, which has been widely studied. Grayscale and HSIs are typical examples of data fusion. Grayscale images have high GSD but only contain a single spectrum. HSIs has high spectral resolution with low GSD. Therefore, these two kinds of data can achieve better complementarity.
Data fusion can effectively improve the performance of the algorithms. With the improvement of shooting technology, dynamic data, such as optical satellite videos and SAR remote sensing videos, have also been developed. In the future, how to realize the fusion of dynamic multimodal data will be a problem deserving of study.

H. Big Model of Remote Sensing
With the development of deep learning, Big Models have demonstrated an unprecedented ability to understand and create, breaking the limitation that traditional AI can only handle a single task, bringing humans one step closer to the goal of general artificial intelligence. In 2020, OpenAI released a pretraining model GPT-3 [330] with 175 billion parameters. It can not only write articles, answer questions, and translate, but also have the ability to have multiple rounds of dialogue, coding, and mathematical calculations. However, there are still many technical difficulties in realizing the versatility of all modalities and all tasks for Big Model. At the same time, due to the limitation of computing resources, its training and application are quite challenging.
There are less studies on Big Model of remote sensing. Using the reasoning ability of Big Model, it is possible to fully mine various remote sensing data and realize the connection of various tasks. The goal of establishing a Big Model of remote sensing is to solve the problem of fusion and utilization of remote sensing data captured in different modalities, different GSD, and at different times and has the ability to cover a series of remote sensing applications, such as classification, detection, and tracking.
The emergence of Big Model has broken our understanding of algorithms. However, its expensive calculation is not practical at this time. The way to use the Big Model is a crucial problem for remote sensing. In the future, knowledge distillation, model pruning and other technologies can be used to extract the learned understanding ability of Big Model into a small model for specific tasks, thereby improving the learning generalization ability of special models.

I. Security of Remote Sensing Algorithm During Training and Inference
Nowadays, we leverage more and more data to train a large model. The security of remote sensing algorithms is also a worthy issue. The security of remote sensing algorithms is mainly divided into two aspects. On the one hand, it is necessary to use a large number of remote sensing data in different regions for training to improve the generalization ability of models. Due to the particularity of remote sensing data, many remote sensing data contain sensitive information related to countries or companies. Many studies have proved that the network may leak data during the training process [331]. Therefore, it is urgent to study how to design and ensure the security of data during training and realize the federated learning of multiparty training of remote sensing data.
On the other hand, when forward inferring the model, the ability to resist external attacks also needs to be paid attention to. In natural scenarios, many neural network attack studies have shown that fixed neural networks are prone to misjudgment due to minor disturbances. The same situation also exists in remote sensing algorithms. If this attack appears in remote sensing algorithms that automate decision-making, it would have serious implications. Small perturbations do not affect the human brain's judgment of objects. Therefore, remote sensing algorithms need to simulate the memory and associative abilities of the human brain to achieve robustness to attacks.

J. Brain-Inspired Remote Sensing Software and Hardware Systems
As the commercial satellite industry has matured, remote sensing data interpretation have become more than just a need for professionals. Most of the existing remote sensing data platform software requires professionals to design and operate corresponding algorithms for different tasks and data. These limitations restrict the widespread civilian use of remote sensing algorithms. Therefore, remote sensing data interpretation software requires algorithms to cover a variety of tasks, apply to different data and put forward requirements for the ease of use of the software. The remote sensing software system designed based on the abovementioned requirements can provide a comprehensive interpretation of data through simple operations. Users can choose to view tasks, such as object classification, target detection, and interpretation results, of any category according to actual needs.
In terms of hardware systems, on-orbit processing of data can more effectively improve data utilization and save data transmission bandwidth. In this review, we introduce the FPGA and its application in remote sensing. In order to run the algorithm directly on the aircrafts or satellites, we can choose to deploy the algorithms on the space-grade FPGA so as to ensure the stability of the system in extreme environments. However, the computing power and extremely low power consumption of the neuromorphic computing chips are more worth looking forward. For example, TianjicX [332] has realized the experiment of a cat-and-mouse game under the condition of ultra-low power consumption and low delay. The total dynamic power consumption of the chip in the experiment is only 0.6 W. When the remote sensing algorithm is used in the neuromorphic computing chips, the on-orbit satellite can process data in real time with ultra-low power consumption. Only the data with research value will be transmitted back to the ground after preprocessing. This improves the efficiency of data collection and analysis. However, the research on neuromorphic computing chips is still in its infancy, and remote sensing algorithms still need further research to be deployed into neuromorphic computing chips. The neuromorphic computing chips still need further research to improve the stability of the chips in space so as to meet the needs of on-orbit data analysis.

VII. CONCLUSION
In this survey, we systematically discussed the brain-inspired algorithms in remote sensing. We first summarize the structure and properties of the brain. These properties include six aspects: sparsity, learning, selectivity, directionality, plasticity, and diversity, which can effectively guide readers to think about brain-inspired remote sensing interpretation algorithms from the characteristics. Further, we summarize the data types and development of five tasks in remote sensing, i.e., object classification, object detection, change detection, object tracking, and 3-D reconstruction. At the same time, the public datasets, the software platforms, and hardware systems are also discussed. The development of brain-inspired algorithms in remote sensing is still not fully explored, and it will help us overcome future challenges.