GAN-Based Satellite Imaging: A Survey on Techniques and Applications

Satellite image analysis is widely used in many real-time applications, from agriculture to the military. Due to the wide range of Generative Adversarial Network (GAN) applications in multiple areas of satellite imaging, a comprehensive review is required in this area. This paper takes the first step in this direction by categorizing the GAN-based satellite imaging research using seven considerations. We discuss not only the challenges but also future research trends and directions. Among the major findings, we have observed increasing componentization and modularization of GANs to be used as elements of larger systems. In addition to the GAN types used exclusively in each application, we demonstrate the deep neural network architectures used as the generator structure. Eventually, we summarize the results and evaluate the significant impact of GANs on improving performance compared to traditional approaches.


I. INTRODUCTION
Compared to ground view image analysis, satellite imaging [1], [2] is in some ways more challenging, not least because of the limited availability of datasets and the cost of data collection. For these reasons, existing high-resolution satellite datasets are task-specific and provide coverage for a limited number of cities [3]. The high cost of gaining access to satellite images and annotating them makes generating a synthetic training dataset a reasonable solution to the need for large quantities of training data. Due to the massive cost of capturing satellite images, Generative Adversarial Neural Networks (GANs) are widely used for satellite image synthesis in defense and non-defense areas, and substantial work has been done to explore this area. Data augmentation is the primary application of GANs to improve classification scores in many imaging domains. However, the application of GANs [2], [4], [5], [6] in satellite imaging is not limited The associate editor coordinating the review of this manuscript and approving it for publication was Gang Mei . to data augmentation. A set of satellite imaging applications with GAN-based solutions already exists in the imaging research community. The nature of GAN applications in the field of satellite imaging is so diverse that we need to categorize them into different categories with several functionalities under each category. To the best of our knowledge, no attempt has been made to survey previous relevant research in this area of applications. In this paper, we take the first step in this direction by assigning GAN applications in satellite imaging to seven different categories. Afterward, we review each category's functionalities, challenges, results, and datasets. Our contributions are as follows.
• We categorize GANs application in satellite imaging into seven different areas, including augmentation, segmentation, localization, translation, object detection, image reconstruction, and surveillance.
• We provide an in depth review on GANs functionalities (for each application) including change detection and prediction (surveillance), cloud removal (image reconstruction), object enhancement (object detection), translation (map synthesis), cross-view synthesis (localization), road-building extraction (segmentation) and image-object augmentation (data augmentation).
• We present the most important GAN types, deep architectures, datasets, challenges, statistics, evaluation metrics, and result summaries related to each functionality. • We not only summarize the results but also evaluate the significant impact of GANs using paired and unpaired t-tests. • We discuss the future trends and research direction in GAN-based satellite imaging.
The rest of this paper is organized as follows. Section II presents the research methodology. Section III introduces the classification of GAN-based research in the field of satellite imaging. Section IV reviews the GAN functionalities corresponding to each category. Section V introduces the deep neural networks used for the generator architecture. Section VI summarizes the challenges, and section VII presents the evaluation metrics. Section VIII reviews the results summary. Section IX presents likely future research trends in this area, and finally, section X concludes the paper.

II. METHODOLOGY
This research provides analysis and classification of publications related to GAN-based satellite imaging. It identifies key applications, functional datasets, techniques, GAN types, and challenges. To collect the related research, many published papers were considered from different resources, including IEEE, Springer, Elsevier, etc. The following methodology was used for the paper selection: 1) search Google Scholar and Google Patents; 2) use defined keywords to find papers with a potential connection to GAN-based imaging; 3) gather a list of candidate papers from steps 1 and 2; 4) remove any sources that are not research studies; 5) remove any sources in which GANs have not been employed as a part of the proposed solution; 6) remove any sources that proposed an adversarial approach but not specifically a GANbased method; 7) remove any sources that are not research studies, and 8) classify papers with respect to their research applications. The challenge of collecting relevant material is that many important published papers can not be found by a simple ''GAN-based satellite imaging''. To find a wide range of related works, we narrowed down the main keyword to more specific keyword phrases such as ''GAN-based semantic segmentation'', ''GAN-based cross-view synthesis'', etc. Figure 1 shows the flowchart of our research method, and Figure 3 shows the percentage of related works in terms of venue and publication year.

A. DOMAIN OF DATA
The ability of satellites to capture extraordinary amounts of data in spatial, temporal, and spectral dimensions has enabled researchers to develop many algorithms to analyze and extract meaningful information useful for a variety of downstream applications. This research can be classified based on the mentioned dimensions as shown in Figure 2.

III. CLASSIFICATION OF GAN-BASED SATELLITE IMAGING RESEARCH
In this section, we categorize the GAN-based satellite imaging research into seven categories based on published papers as follows.  Figure 4 shows the seven categories with the related functionalities and corresponding GAN types used in previous research.

A. SATELLITE IMAGE SEGMENTATION
Segmentation has found widespread application in the domain of satellite imaging, including but not limited to image augmentation, object detection, change detection, geolocalization, and cross-view image synthesis [7]. Dimensionality reduction in satellite images via segmentation has many applications in satellite image processing, including road extraction, building extraction, which is important for urban planning, and climate change detection, which is essential in sustainable development and forest preservation research. Road and building extraction and land cover classification in satellite image analysis are based on the semantic segmentation task [8], which is the process of associating each pixel of an image with a class label [9]. According to [8] there are big differences between satellite imagery and everyday pictures, such as PASCAL VOC2012 [10] and Microsoft COCO [11]. Satellite imagery assumes a bird's view acquisition, and thus objects lie within a flat 2D plane, and every pixel in satellite images has semantic meaning. However, the PASCAL VOC2012 dataset assumes a human-level point of view and images, thus mainly consisting of meaningless backgrounds with a few foreground objects of interest [9]. On the other hand, tasks like building extraction have their own challenges, including the different appearances, shapes, materials, and surroundings of buildings in different cities. These challenges make it difficult to test the models in other cities. For this reason, no generalizable model yet exists that can achieve the desired accuracy in different satellite images.
As a consequence, a trade-off exists between accuracy and generalization in all types of ML models trained on satellite images [12].

B. LOCALIZATION
Training a model to generate realistic scenes has always been a challenging task in computer vision, especially when dealing with translating images belonging to drastically different views. This is mainly because extracting the semantic information across the views is not trivial [13]. The other type of knowledge extraction from satellite images is called geo-localization. According to [14], the core task in geo-localization is to determine the real-world geographic location (e.g. lat-long) given an input image. This image, in turn, usually has a specific application for scene localization in social media, unmanned driving, navigation, and augmented reality.

C. IMAGE TRANSLATION
Creating maps is one of the most important tasks from a commercial value perspective, and is valuable to companies in different areas, from ride-sharing and food delivery to military, intelligence, and international security [15]. However, it is a very expensive and time-consuming process [16]. Generative models can address this problem by finding the patterns between the input and output image (which is called image translation) to enable the conversion of satellite images to the corresponding map. Different techniques for the imageto-image translations like Conditional Generative Adversarial Networks (CGANs) are used to generate the corresponding human-readable maps for that region [16].

D. OBJECT DETECTION
GANs have been widely used to enhance the style or appearance of satellite images by transferring the images into a target domain [17]. In general, objects in satellite images suffer from low resolution and insufficient color information due to (often widespread) distortion. Because of this, the detection of weak objects in satellite imagery VOLUME 10, 2022 remains a challenge [17]. According to [17], appearances and qualities of remote sensing images [18] are affected by different atmospheric conditions, quality of sensors, and radiometric calibrations. As a consequence, the generalization of a deep learning or other machine learning model would be compromised in the absence of image enhancement 118126 VOLUME 10, 2022 which can improve the visual effects of remote sensing images [18].

E. IMAGE RECONSTRUCTION
Removing the clouds in high-resolution satellite imaging is an essential pre-processing step since the climate inevitably affects them. According to [19], clouds in satellite images are assignable to three categories: thin clouds, thick clouds, and cloud shadows. All three are preferably removed in the preprocessing step. We will discuss more cloud removal in the next section.

F. SURVEILLANCE
From monitoring changes in land cover to agricultural surveillance, the sequence of satellite images as time-series data offers a rich source of information for the researchers. Other applications include urban expansion analysis, coastal/riparian change detection, and flood risk assessment [19]. In this regard, GANs can play a significant role in synthesizing the forecasting satellite images in a specific range of time.

G. DATA AUGMENTATION
A large number of training samples is vital for supervised semantic segmentation. In the absence of sufficient instances of objects belonging to each class, the trained model will not be able to correctly learn the characteristics of the target objects. This will result in poor performance or even failure of semantic segmentation tasks. To tackle this problem, GANbased data augmentation has been used more frequently in recent years [28]. Abady et al. [29] used DCGANs to generate multispectral images taken by aerial devices or remote sensing satellites. Multispectral images are important for providing additional information that the naked eye cannot detect. Huang et al. [28] proposed an object-level remote sensing image augmentation approach based on leveraging the U-Net-based Generative Adversarial Networks. Howe et al. [30] proposed a two steps data augmentation. First, they used progressive GANs [12] to generate synthetic segmentation masks. Second, they translated the masks to synthetic satellite images using conditional GANs.

IV. FUNCTIONALITIES OF SATELLITE IMAGING
This section provides an in-depth review of GAN-based functionalities in each of the categories discussed earlier.

A. OBJECT EXTRACTION
Image segmentation is one of the most important stages in geospatial information system (GIS) analysis. Elements of GIS analysis include rare object detection, cross-view image synthesis, change detection, and urban-infrastructure expansion analysis [25]. In the case of satellite image processing, previous works have focused on three types of segmentation tasks: road extraction, building extraction, and general segmentation, which aims to extract not only roads and buildings but other objects like cars and airplanes. Table 1 summarizes the most important previous works in GAN-based satellite image segmentation.

1) ROAD EXTRACTION
According to [23], roads act as a fundamental unit for many geographic information system applications, such as vehicle navigation, traffic management, and emergency response. They are also essential elements of military surveying and mapping. Compared to traditional road network extraction, which is done manually and requires massive effort and human resources, aerial images provide a rich source of information about the ground cover. Given the high-resolution satellite images, ML-based road extraction has become the first choice in satellite image processing [23] and GAN-based methods have been practiced more frequently in this direction. In general, by taking advantage of CGANs, the road extraction task becomes an image-toimage translation task [20], [21]. Shi et al. [20] used an encoder-decoder architecture in the generator and added a term of entropy loss to the loss function. The main contribution of their work is to use Segnet [31] as the generator architecture to ensure the consistent resolution which is required against complex occlusions like cars and trees. However, pixel-level road extraction needs huge memory. To address this problem, Costea et al. [21] used a graph representation for roads. They used a two-stage framework to extract roads, in which two GANs were first used to detect roads and intersections. The best covering road graph was found next by applying a smoothing-based graph optimization procedure. Both methods chose encoder-decoder architecture in the generator, which makes the generator have poor ability to generate finer images [23]. To address this problem, Zhang et al. [23] proposed an improved GAN architecture with two advantages compared to previous works: (i) a simpler architecture compared to [21], which has two stages method. (ii) the use of a content-based loss function to make sure that the results are accurate. One of the common problems of the aforementioned methods is the low performance against imperfect road structures. To address this problem, Abdollahi et al. [27] proposed a GAN-based deep learning approach for road segmentation from high-resolution aerial imagery with a modified U-Net model (MUNet) in the generative part of the presented GAN and edge-preserving filtering as the pre-processing phase. To alleviate the overfitting, Hu et al. [3] proposed a diversity-sensitive loss to force the generator to produce different synthetic images. They were also inspired by SinGAN [32] and proposed conditional-SinGAN (cSinGAN) to restrict the synthesized images to follow the desired scene structures described by the mask.

2) BUILDING EXTRACTION
Accurate building extraction using semantic segmentation from high-resolution images is used in applications like urban planning, updating of geospatial databases, and disaster management. Building extraction is especially challenging due to the existence of some obstacles like cars, vegetation cover, and shadow of trees in the satellite images [25], and most of the non-adversarial methods exhibit poor performance in building segmentation. To tackle these issues, Aung et al. [24] applied a conditional Generative Adversarial Network (CGAN) to extract building footprints from GeoEye images of Yangon city, Myanmar. The main contribution of their research is to analyze the performance of pix2pix with different hyper-parameters to find the best configuration. However, they didn't evaluate the flexibility of pix2pix architecture in the face of occlusion problem. Abdollahi et al. [25] Showed the failure of traditional conditional GANs to tackle the heterogeneous occlusions in remote sensing imagery. To obtain a non-noisy map of segmentation with high spatial contiguity, they utilized SegNet with BConvLSTMs for the generator part of the proposed GAN model to generate a high-quality segmentation map. The advantage of this architecture is that, a set of convolutional filters coupled with hyperbolic tangent functions assist the model to learn structures of data.

3) WEAK OBJECT EXTRACTION
Considering the cumbersome task of satellite image segmentation, research infrequently focuses on extracting multiple objects in a highly representative and diverse labeled training set. Desai and Ghose [7] proposed to use an active learning-based sampling strategy to overcome the challenge of labeling a highly representative set of training data. Active learning has been practiced before in semantic segmentation to detect the most informative patches of the input image. However, they used active learning as a sampling strategy to find the most informative images from the given dataset.

B. CROSS-VIEW IMAGE SYNTHESIS
View synthesis is a long-standing problem in computer vision [13]. This task is more challenging when views are drastically different, like aerial to ground view synthesis, due to little overlap and the existence of occluded objects. The generation of street views from given satellite or aerial images is thus an attractive and interesting alternative since the acquisition of street-view images is rather expensive, and regular updates are required to capture changes. Generating street-view images is vital in some areas of applications like virtual or mixed reality, realistic simulations, and gaming, viewpoint interpolation or cross-view matching, exploring remote places, strategic ground planning in emergency and intelligence operations. On the other hand, satellite images are generally much more widely available than street-view images since they are regularly captured, easier to obtain, and have significantly better earth coverage [36]. In contrast, the task of ground view to aerial view translation can be applied in different sets of tasks like image localization in social media, unmanned driving, navigation, and augmented reality [14]. The main difficulty is that the aerial view of an object (i.e., a building) reveals very little about the shape and color in the street view. In a better word, two objects that are similar in one view may look quite different in another. As a consequence, the generation process is generally more challenging when the scene contains multiple objects compared to the case of a single object at a uniform background which is basically caused by underlying obstacles that contribute to the variations like occlusions, shadows, etc [13]. According to [13], some of the challenges of cross-view image synthesis tasks are as follows. • Information in aerial images is too noisy and less informative compared to street-view images since street-view images contain more details about objects (e.g., houses, roads, trees) than aerial images.
• Aerial images and their corresponding images might be different because of transient objects like cars and passengers.
• Houses that are different in street-view look similar in aerial view because of similar rooftops.
• Road edges are often occluded by dense vegetation and contortion in aerial view While single street-view image synthesis has recently been investigated [13], [38] panoramic view generation methods are more suitable to create continuous viewpoint creation around a given location since they are built upon geometrically consistent image sequences with constraints on the correspondence between frame pixels [36]. Table 2 summarizes the most important research in the field of cross-view image synthesis.

C. MAP SYNTHESIS
Map generation is a very expensive and time-consuming process with commercial value to companies in multiple sectors of the economy [15]. On the other hand, it is challenging to generate maps quickly and efficiently for emergency rescue operations such as earthquakes, fire disasters, or tsunamis [44]. In this section, we review the most important research in this area as tabulated in Table 3. The pix2pix [40] was the first type of conditional GANs used for map synthesis given paired training data. However, obtaining paired training data can be costly and cumbersome, especially for map synthesis. To address this problem, Zhu et al. [41] proposed CycleGAN, which can learn to translate between domains without paired input-output examples. While Cycle-GAN counts on L1 norm to calculate cycle-consistency loss, Ganguli et al. [15] observed that the L2 norm performs better for map synthesis. They also proposed GeoGAN, a model that takes a satellite image as input with a specified zoom level and resolution and synthesizes the corresponding human-readable map for that location. However, the conversion may become more challenging when some objects are hardly visible from the satellite images (e.g., an underpass, or a route with a similar color to its environment). Therefore, the solely image-based GAN framework is not sufficient for this specific satellite-to-map image conversion task.
To overcome the above obstacles, Zhang et al. [42] proposed an enhanced GAN model to generate improved-quality map images using the GPS coordinates as additional knowledge. In addition to circularity constraint, Song et al. [44] integrated geometrical consistency constraint into the whole architecture to reduce the translation's semantic distortions. They proposed a novel unsupervised domain mapping framework called MapGen-GAN for the quick transformation of remote sensing images to maps to be used in emergency response scenarios. Andrade and Fernandes [43] used CGANs to convert historical maps into satellite view images.

D. IMAGE ENHANCEMENT
Many object detection methods obtain high detection scores on high-resolution aerial images. However, in some cases, objects like airplanes or vehicles may be ''weak'', a term meaning that they are occluded by other objects. As a consequence, details are often lost in the absence of sufficient color information [18]. Li et al. [17] used GAN-based imagelevel domain adaptation to transfer the style of the target image to a new space with a similar distribution to the source image space in order to improve the detection scores. Table 4 shows a summary of research related to satellite object detection. Gao et al. [18] proposed a detection-guided CycleGAN (DE-CycleGAN) to enhance the weak targets for the purpose of accurate vehicle detection in the absence of paired images. Although Enhanced Super-Resolution GAN (ESRGAN) [53] showed remarkable image enhancement performance, reconstructed images usually miss high-frequency edge information. To address this problem, Rabbi et al. [48] proposed a new edge-enhanced super-resolution GAN (EESRGAN) and detector network whose loss is back-propagated into the EESRGAN in an end-to-end manner to improve the detection performance.

E. CLOUD REMOVAL
Clouds can impact the quality of satellite remote-sensing images. With the prevalence of deep learning techniques in recent years, a variety of techniques have been tried for cloud removal. Image restoration or the removal of certain objects such as rain and snow has been widely applied before. Compared to deep learning approaches, generative modeling has proven to be a more effective method for recovering missing information based on a learned distribution. This section reviews the GAN-based cloud removal research as tabulated in Table 5. In the first research in this direction, Enomoto et al. [50] extended the input channels of CGANs to be compatible with multispectral images in order to remove clouds from visible light RGB satellite images. However, they rely on just a single cloudy image instead of multiple cloudy images taken at different times. As a consequence, generated images often lack detail and specificity in partially occluded cloudy regions. To overcome this issue, Uzkent et al. [51] collected two new paired datasets, one including a cloud-free image in a given location as  well as corresponding cloudy images from publicly available Sentinel-2 satellite images [54].

F. CHANGE DETECTION AND PREDICTION
Satellite image computational surveillance refers to change detection or change prediction over the same geographical area at different periods. According to [56], it is widely used in disaster assessment [63], environmental monitoring [64], and urban expansion [65], among other applications. For example, Boulila et al. [61] evaluated the performance of pix2pix and Dual-GAN to predict urban expansion in the three largest cities in Saudi Arabia. Supervised and unsupervised techniques are both commonly used in change detection. Supervised methods usually transform change detection tasks into a classification that divides each pixel into two different classes, while unsupervised methods usually identify changes via thresholding or clustering. Although it is expensive to obtain large amounts of annotated data, the quality of synthesized change maps is higher in supervised approaches. High-level features are crucial for extracting a semantic change map that is invariant against distracting factors such as noise and scale variations. Table 6 summarizes the research in this area.

G. CLASSIFICATION
Classification is one of the pillars of ML-based surveillance tasks. Perez et al. [55] trained a WGAN [66] for semi-supervised poverty prediction given a set of limited labeled data in which the task of the discriminator is to distinguish not only the fake and real instances but to distinguish the correct class of unlabelled training data.

H. CHANGE MAP EXTRACTION
Change map extraction is the most frequently used functionality in GAN-based surveillance of satellite images. The traditional GANs learn a mapping from a random noise vector z ∼ p z (z) to the output image. Given two input images X t1 and X t2 and an input noise variable z, the change detection GAN (CDGAN) can be modeled as inferring the change maps from the joint distribution of p(X t1 , X t2 , z). It can be rewritten as a conditional density estimation model p(CM |X t1 , X t2 , z) where CM is the inferred changed map. Hou et al. [56] employed CGANs by adopting W-Net as the generator for change detection in satellite images. However, The change detection based on generative models given just the pixel distributions of the input images proved to be vulnerable to noise or temporary occlusions like clouds. To tackle this problem, Gong et al. [67] used Convolutional Neural Networks to transfer the input image and obtain a feature image. Afterward, they applied CGANs on feature images to infer the changed map.

I. OUTLIER DETECTION
GAN-based anomaly detection has proven to be successful even in high-dimensional non-image data [68]. In the field of satellite imaging, an outlier detector GAN was originally proposed by [62] to address the change detection problem. In this research, Jian et al. [62] used a generator to synthesize unchanged data, while the discriminator was responsible for distinguishing between changed and unchanged data.

J. TIME SERIES IMAGE SYNTHESIS
Generating a sequence of images is crucial to predicting the weather by synthesizing sequences of cloud images. Xu et al. [57] proposed a Generative Adversarial Networks-Long Short-Term Memory (GANLSTM) model for satellite image prediction by combining the generating ability of the GAN with the forecasting ability of the LSTM network. Their proposed training process is divided into two steps. First, a GAN model is trained given the real satellite image data. Second, the parameters of the generator are kept fixed, and an LSTM network is attached and trained to produce the generator inputs. The LSTM network will try to capture the implicit features which contain the evolution information about the clouds. Finally, the generator translates the implicit features received from the trained LSTM network to synthesize the final predictions.

K. SPATIAL ATTENTION MAP GENERATION
Spatial Attention Maps are used to learn and focus more on the important information rather than learning non-useful background information. SpA-GAN is a type of GAN in which the generative network is a Spatial Attentive Network (SPANet) [19]. SPANet is used to discover and extract attention maps from the input feature maps. The output attention map is an image in which each pixel value indicates the importance of that pixel and how much attention should be allocated to the pixel. The larger the value, the more attention should be given. It indicates the spatial distribution of the target objects, which can guide the subsequent steps for the removal or extraction of that object.

L. OBJECT ENHANCEMENT
The goal of image enhancement is to improve the performance of weak object detection by enhancing image quality. This is addressed via two different methodologies [18]: (i) image prepossessing, such as denoising [69], image sharpening [70], and histogram equalization. (ii) Supervised methods to provide external information like super-resolution (SR) [71], high dynamic range (HDR) [72], and salience enhancement [73]. In one of the rare GAN-based works in this area, Gao et al. [18] used an enhanced CycleGAN called DE-CycleGAN for vehicular image enhancement by imageto-image translation.

M. OBJECT AUGMENTATION
The goal of object augmentation is to insert a particular object into an already available satellite image. The advantage of object augmentation is that we can augment a limited number of objects to different positions of many available satellite images. Martinson et al. [49] obtained a range of 3D models for three different categories of objects. Blender3D was next used to generate images from 3D models that contain a specified viewing angle, lighting condition, and shadow. Finally, the acquired object was merged with a designated satellite image via CycleGAN. Table 7 shows the list of GANs adopted to be used for each functionality.

V. GENERATOR ARCHITECTURES
In this section, we review different deep neural network architectures used as generators in satellite imaging tasks.
A. U-NET U-Net [74], [75] is one of the well-known deep neural network architectures with a symmetrical structure that looks like a U letter. U-Net includes iterative 3 × 3 convolution and max-pooling layers, followed by copy and crop operations on the output of the last layer. This is considered as one of the factors making U-Net so successful in semantic segmentation. In this regard, skip connections play a very important role in the U-Nets by passing the information from the down-sampling blocks to the corresponding up-sampling blocks. In the GANs context, U-Nets have been used in both discriminator [76] and generator [40] architectures. Unlike traditional GAN models, the source of randomness does not come from latent space in the U-net based generators. Instead, dropout layers are used as a source of randomness during both training and inference. In the object augmentation architecture proposed by [28], the ship objects are randomly selected from the existing library (extracted from the training set). Then they are merged with background and land and finally, the U-net generator is used as a translator to synthesize a realistic image. According to [77], U-Nets have two limitations: (i) Finding the optimal network depth needs an extensive architecture search. (ii) Aggregate features from the skip connections have fixed semantic scales at the decoder sub-networks leading to an inflexible feature fusion scheme.

B. W-NET
Although U-net proved to be successful in supervised semantic segmentation, collecting sufficient supervised pixel-level labels is difficult to obtain in many applications. To address this problem, W-Net [78] was designed for unsupervised change detection. W-Net is a dual-branch network that accepts two images as input and extracts features from them independently. It then generates change maps with the extracted difference features. The W-Net architecture consists of two concatenated U-net architectures. First U-Net acts as an encoder that outputs the image's segmentation, and the second one is a decoder that reconstructs the image from this segmentation. In the proposed W-Net by [56] for change detection in satellite images, the encoder contains four convolutional blocks, and each block consists of one traditional convolutional layer with stride 1 and one strided convolutional layer with stride 2. The decoder contains four deconvolutional blocks, with each consisting of one deconvolutional layer with stride 1 and one deconvolutional layer with stride 2. Compared to single-branch architecture, W-Net VOLUME 10, 2022 architecture can realize the feature expression of each image and decrease the information lost.

C. MUNet
MUNet [79] is the modified version of U-Net used by Abdollahi et al. [27] for road extraction. MUNet includes two corresponding arms, a contracting (downsampling) encoder and an expanding (upsampling) decoder, with skipconnections that append every upsampled feature map at the decoder with the corresponding one in the encoder that has the same spatial resolution. Compared to the U-Net, the changes in the MUNet architecture include the introduction of batch normalization, the use of the ReLU activation function in the decoder but Leaky ReLU for the encoder, and the elimination of the pooling layer. The proposed MUNet does not require high computational time and a large training dataset. However, it's sensitive to occlusion by trees and shadows, which leads to false negative pixel prediction in the road extraction task.

D. LSTM
LSTM [80] is a type of recurrent neural network proven successful for learning long-range dependencies in a sequence of instances. The building block of LSTM is the cell state denoted by C t , which saves the state information. C t is controlled by three gates. (i) f t to decide when to forget C t , (ii) i t to decide when to keep or override C t and (iii) o t to decide whether the latest cell output C t will be propagated to the final state h t . In GAN-LSTM [57], where the generator of the GAN is attached to the output of the LSTM network, the spatiotemporal relationship can be learned without complex atmospheric modeling. In the proposed model by Xu et al. [57], h t represents the evolutionary information derived from the sequence of satellite images. Their proposed model combines the generating ability of GAN with the ability of LSTM [81] to extract the evolutionary information from the image sequences to extrapolate the cloud's motion.

E. BI-DIRECTIONAL CONVOLUTIONAL LSTM
Bi-directional convolutional LSTM (Bi-CLSTM) [82] is a sequence processing model that consists of two LSTMs: one taking the input in a forward direction and the other in a backward direction [83]. BConvLSTM was used by Abdollahi et al. [25] to obtain a non-noisy map of segmentation with many details that explain the boundary information, Their proposed method includes SegNet [31] with Bi-CLSTM as an encoder-decoder model for the generator part of the proposed GAN model. The Bi-CLSTM role is to mix the encoded and decoded features rather than using a simple concatenation. As a result, the obtained feature maps contain both semantic and local information.

F. SPATIAL ATTENTION NET
Spatial Attention Net (SpA-Net) [84] is used to extract a spatial attention map which helps us to focus on the most informative part of the image. Heng Pan [19] proposed SpA-GANs for cloud removal in which the generator is a SpA-Net. In their proposed generator, the input image first passes through three standard residual blocks to extract features, then passes through four spatial attentive blocks (SAB) to identify clouds progressively in four stages, and then passes through two residual blocks to reconstruct a clean background. The proposed generator tested on removing thin clouds from a single optical remote sensing imagery. However, there are two main challenges, according to [85]: (i) The traditional SpA-Net does not have the ability to consider the remote relationship between the image blocks. Second, it fails to preserve the details of the original image in the face of serious occlusion caused by thick clouds.

G. BRANCHED ResNet
Branched ResNet as a generator of spatiotemporal generator networks (STGAN) was proposed by [51] to extract the features from multiple images taken at the same location to generate a single cloud-free input. The proposed Branched ResNet is based on individual encoder-decoder pipelines. In this architecture, two convolutional layers with stride 2 are employed to downsample the feature map. These are followed by nine residual convolutional layers, followed by two convolutional layers with stride 1/2 to upsample the feature map. The inspiration for the architecture of the encoder-decoder came from [86] and was previously shown to perform well in CycleGAN [87].

H. EDGE ENHANCED NET
Due to the noisy nature of some aerial scenes, image enhancement is crucial for object detection in satellite images. One of the deep architectures used as a generator is the Enhanced Edge Network (EEN) [88]. The EEN removes noise and enhances the extracted edges from an image. First, a Laplacian operator [89] is used to extract edges from the input image. Afterward, the edge information is extracted via passing through convolutional, residual-in-residual dense blocks (RRDB) and upsampling blocks. There is a mask branch with sigmoid activation to remove edge noise as described in [90]. Finally, the enhanced edges are added to the input images where the edges extracted by the Laplacian operator are subtracted [48]. Table 8 tabulates the deep neural networks used as a generator in different satellite imaging functionalities.

VI. SUMMARY
In this section, we summarize the covered research with a focus on challenges and statistics.

A. CHALLENGES
In this section, we review the most important challenges in GAN-based satellite imaging.

1) OCCLUSION PROBLEM
Weak object detection is a significant challenge in satellite imaging due to the lack of sufficient color information masked by barriers such as vegetation cover, shadows, overlapping, and interlacing sheltering [25]. In addition, complicated backgrounds and occlusions are created by the spatial and spectral overlapping of roads with other regions such as parking lots and buildings [27]. Occlusion often leads to poor unsupervised segmentation, and outlier detection is used to mitigate the consequences.

2) GENERALIZATION PROBLEM
Training segmentation models is a challenge in satellite imaging because of the similar appearance of rooftops to other objects such as cars. Building shape, building material, and surrounding land cover varies widely from city to city. As a consequence, transferring the models between cities may show a significant inaccuracy. That's why no generalizable model yet exists that can accurately detect the objects in all satellite images, and a trade-off exists between accuracy and generalization [12]. Image augmentation is regarded as a data-driven solution to address this problem.

3) THE CHALLENGE OF HIGH-RESOLUTION IMAGES SYNTHESIS
Synthesizing high-resolution images requires a strong pixel-level mask which is very hard to collect in satellite images [3]. In the absence of such accurate masks [12], GAN-generated satellite images are easy to discriminate from real ones. Image enhancement can be helpful, especially in improving the quality of weak objects.

4) THE CHALLENGE OF MULTIMODAL DATA
According to [22], one of the key challenges in satellite imaging is to integrate the information acquired with different spatial resolution, spectral bands, and imaging modes from sensors mounted on satellites, aircraft, and ground platforms. The final goal is to reach a representation that contains more detailed information than each of the individual sources. Transfer learning is supposed to play a key role in this context in order to extract the best set of representing features of multimodal Data.

5) SPARSE AREAS PROBLEM
Focusing on a set of limited objects of interest leaves the majority of unlabeled areas sparse. As a consequence of a sparse ground-truth mask, a set of unwanted artifacts and blurry objects may appear in unlabelled areas, which is a significant challenge for satellite image synthesis [22]. The characteristics of the ''noise'' in these sparse areas, however, may be a means of asserting the origin of synthetic images. That is, characteristic noise may arise from specific GAN architectures.

6) RARE OBJECT DETECTION CHALLENGE
Rare object detection is another challenge in satellite imaging. For objects that already have well-supported labeled training data, there are many existing methods to train a model. However, in the case of rare objects, it is not easy to build a reliable model [49]. Object augmentation can play a positive role in addressing this problem via embedding the rare objects in different locations of the scene. Figure 5 shows the challenges-consequences-solutions diagram.

B. STATISTICS
In this section, we summarize some facts in GAN-based satellite imaging research: • Segmentation as a backbone of other applications like object detection is a heavily researched topic, with 23% of total research as shown in Figure 6.
• Change detection has 68% of all the research in the surveillance category.   • Traditional CGANs and CycleGANs are the most frequently used GAN types in satellite imaging.
• Google maps, UC Merced, CVUSA, and CVACT are among the most frequently used datasets in GAN-based satellite imaging, as shown in Figure 7.
• Unsupervised GANs are the most frequently used GANs, comprising 59% of all research, as shown in Figure 8.

VII. EVALUATION METRICS
In this section, we review the most frequently used metrics to evaluate the GAN-based satellite imaging systems. The description of mentioned metrics is as follows.
• The missed alarm rate (MAR), false alarm rate (FAR), and overall error rate (OER) are qualitative measures that are used in change map extraction [62]. • False negative (FN) denotes the number of pixels wrongly classified as unchanged ones, false positive (FP) represents the number of pixels wrongly classified as changed ones, true negative (TN) means the number of pixels correctly classified as unchanged ones, and true positive (TP) is the number of pixels correctly classified as the changed ones. The metric kappa [67] is usually used for measuring classification performance, and a higher kappa value means better performance and PCC denotes the percentage of correct classifications.
• Two standard metrics used in measuring image similarity and degradation are peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) [52]. PSNR, largely based on mean-squared error (MSE), is a metric that is based on the average difference between corresponding pixels in two images.
• The realism and diversity of the synthesized images are measured by Inception Score (IS) [42], Top-k prediction accuracy, Frechet Inception Distance (FID) score [42] and KL divergence [59].
• The pixel-wise semantic consistency of the synthesized images is measured using mean Intersection-over-Union (mIoU). IoU [22] refers to the number of common pixels (intersection) between the target and prediction masks divided by the union of existing pixels across both masks and is, in effect, a Jaccard similarity. Table 9 and Table 10 summarize the evaluation metrics used on GAN-based satellite imaging.

VIII. RESULTS ANALYSIS
One of the most significant challenges for meta-analysis of satellite imaging research is their inconsistent nature due to different applications, tools, and datasets. For this reason, it is appropriate to focus on the most relevant experimental results. This section summarizes the most important experimental results in GAN-based satellite imaging. To do so, we compare GANs in different satellite imaging applications as shown in Table 11. In this table, each color represents a different category of research. For example, blue cells represent the comparison between three different GANs used for change detection [61] in terms of SSIM and MSE measures. The facts illustrated in Table 10 can be summarized as follows.
• In most of the cases, conditional GANs show better performance compared to unconditional GANs [56].
• In all of the cases, Pix2pix outperforms traditional CGANs except for ground view image synthesis. This failure is due to the lack of generalization, which causes more artifacts after transforming fundamentally different views. In such cases, when the GAN is too complex, there is less capability for generalization because of susceptibility to over-fitting.
• In the field of cloud removal Spa-GANs and STGANs show superiority over the traditional CGANs, and CycleGANs [19].
We also need to evaluate the significance of GANs' impact in different application categories. To do so, we run the paired and unpaired t-tests on the reported results in the literature.
The unpaired t-test is run to investigate if the GANs can improve the performance compared to non-GAN approaches. For example, the experimental results reported in [42] for the satellite to map conversion allows us to create distinguished result populations to run unpaired t-tests. In this research, the segmentation-based methods and GAN-based methods were compared in terms of IS, FID, and SSIM. To evaluate the significance of GANs' impact, we divided the GAN methods into two groups to keep the size of the result populations equal for the t-test. Obtained results show that the two-tailed p-values are less than 0.0001 in IS, FID, and SSIM, which is VOLUME 10, 2022 considered to be extremely statistically significant. Table 12 and Table 13 show the related information on IS, where SD denotes standard deviation and SEM denotes Standard Error of the Mean, which measures how far the average of the data is likely from the true population mean. Furthermore, we evaluated the impact of DE-CycleGAN on weak object detection based on the results reported in Gao et al. [18]. In this research, the performance of weak object detection has been reported with and without DE-CycleGANs. These results let us run a paired t-test to evaluate the significance of the impact of DE-CycleGAN. The paired t-test is run when each experiment has been repeated two times: with and without GANs. The t-test results show that the two-tailed p-value equals 0.0090, which is considered to be very statistically significant by conventional criteria.

IX. FUTURE RESEARCH DIRECTIONS
This section presents the potential future research directions in GAN-based satellite imaging.

A. WEAK OBJECT SYNTHESIS
Weak object detection was mentioned as one of the challenges in satellite imaging. Weak object synthesis is even more challenging due to the highly skewed nature of classes in the segmented images containing the weak object. Although object augmentation can be used to fight imbalanced regions, more advanced architectures to deal with this challenge is highly anticipated in the future.

B. FORGED IMAGE GENERATION AND DETECTION
Forged image detection has a long history in computer vision [91], [92]. Satellite images are supposed to contain sensitive scenes and structures. It may lead to some attempts to hide sensitive regions or objects in satellite images. Although research in this area is underway [46], more advanced techniques for GAN-based satellite image manipulation and forgery detection will be required in the future.

C. EVENT GENERATION AND DETECTION
With the emergence of event-based cameras, a revolutionary type of imaging device known as a silicon retina, framebased algorithms are no longer applicable. New machine learning models are also required to meet the event-based data resources. For example, Spiking Neural Network (SNN) [93] was proposed to learn discrete events rather than continuous values. SNNs are also considered more biologically realistic since biological neurons use discrete spikes to compute and transmit information [94]. The application of event-based cameras in satellite imaging [95] has already started, and the GAN-based event generation and detection are highly anticipated to be practiced in the near future.

X. CONCLUSION
In this paper, we reviewed GAN-based satellite imaging research with a focus on GAN applications and functionalities. First, we categorized the GAN applications in this area in accordance with seven different categories. Afterward, the most important related works in each category were summarized by functionality, GAN type, dataset, and year of research. The same summary was provided for each GAN functionality. Next, we reviewed different deep neural network architectures used as the generator. Finally, we summarized the main challenges in GAN-based satellite imaging. Among the major trends observed for the application of GANs to satellite images, there are some meta-trends observed that will impact the continued adoption of GANs in the near future. The first is the use of GANs for deep fakes, which is already a concern because of the ability to continually ''tune'' GANs to create a greater resemblance between the target and output images. Deep fakes of satellite images will be especially concerning where they are used in order to deceive a real-world adversary, such as in warfare. They could also be used by terrorists or other rogue organizations in order to interfere with disaster response, among other scenarios. The second major trend of note is that, as for other AI toolsets, GANs are increasingly being componentized and modularized in order to work with larger machine intelligence/computer vision systems. Examples of this include the hybridization of GAN and GPS data for map translation and the use of GAN and LSTM to extrapolate cloud motion. The third trend of note, indicated by the combination of GAN and W-Net technologies, is the possibility of discovering differences in background noise by using the different features between multiple images. Such approaches could be used as a relatively reliable means of discovering when a deep fake has occurred since noise differences may persist even when functional aspects (object recognition, accurate map translation, etc.) of the GAN output images are equivalent to target images. We expect to see all three of these areas of research continue to garner increased attention in the near future.