Spectral–Temporal Fusion of Satellite Images via an End-to-End Two-Stream Attention With an Effective Reconstruction Network

Due to technical and budget constraints on current optical satellites, the acquisition of satellite images with the best resolutions is not practicable. In this article, aiming to produce products with high spectral (HS) and temporal resolutions, we introduced a two-stream spectral–temporal fusion technique based on attention mechanism called STA-Net. STA-Net aims to combine high spectral and low temporal (HSLT) resolution images with low spectral and high temporal (LSHT) resolution images to generate products with the best characteristics. The proposed technique involves two stages. In the first one, two fused images are generated by a two-stream architecture based on residual attention blocks. The temporal difference estimator stream estimates the temporal difference between HS images at desired and neighboring dates. The reflectance difference estimator is the second stream. It predicts the reflectance difference between the input images (HS–LS) to map LS images into HS products. In the second stage, a reconstruction network combines the latter two-stream outputs via an effective learnable weighted-sum strategy. The two-stage model is trained in an end-to-end fashion using an effective loss function to ensure the best fusion quality. To the best of our knowledge, this work represents the first attempt to address the spectral–temporal fusion using an end-to-end deep neural network model. Experimental results conducted on two actual datasets of Sentinel-2 (HSLT:10 spectral bands and long revisit period) and Planetscope (LSHT: four spectral bands and daily images) images, which proved the effectiveness of the proposed technique with respect to baseline technique.


I. INTRODUCTION
T HANKS to the increased requests for satellite images with higher resolution, current spaceborne sensors benefit from Manuscript received 16  the recent technological progress, enabling the acquisition of a wide range of data with different proprieties in terms of spatial, spectral, temporal, and radiometric resolutions. Interpretation and analysis of such data received increasing attention from the remote sensing (RS) community. In particular, image temporal series are playing a significant role for monitoring of land surface dynamics over time for various applications, including monitoring vegetation, detecting and monitoring land-cover changes, and change detection of land cover. Unfortunately, despite the substantial technological progress in optical satellites, the capture of satellite images with the best characteristics in all aspects is not yet feasible due to technical constraints and budget limitation. Researchers proposed powerful image fusion algorithms to combine satellite images with different characteristics into one product [1]. RS image fusion is an effective method aimed at merging one or multiple satellite data to generate a single product with a better interpretability. The RS fusion images are in a continuing evolution, thanks to the growing demands from leading companies, such as google earth and microsoft visual earth [2], which aim to enhance the resolution of their commercial products, and this process can be achieved by effective fusion techniques. Initially, the fusion techniques proposed a solution to enhance the spatial resolution of satellite images or to combine multimodal data. Over the past years, the aim of these methods expanded to include more challenging fusion problems, such as the fusion of different images with a different but complementary spatial and temporal resolution [3], [4]. For instance, on the one hand, satellites, such as Sentinel-2, IKONOS, and Landsat, produce images with spatial resolution varying from 3 to 30 m, which is recommended for dynamic monitoring [5], change detection [6], and land-cover mapping applications [7]. However, the observations of these kinds of satellites in a specific area are characterized by relatively long revisit cycles (Sentinel-2: 5 days and Landsat: 16 days). Besides, this period can increase due to cloud coverage or poor atmospheric conditions. This coarse temporal resolution reduces their application for monitoring and detecting rapid change, in particular, in monitoring plant health and phenology [8]. On the other hand, satellites, such as MODIS and SPOT VEGETATION, can capture daily observation but with a coarse spatial resolution ranging between 250 and 1000 m. Such a spatial resolution does not guarantee a sufficient spatial detail for monitoring and detecting changes for specific areas of interest. The emergence of CubeSats, especially those designed by Planet Lab's, can provide currently daily products at high resolution (3 m) with four bands. Planetscope exploits more than 180 nanosatellites to offer a valuable data source with a great promise to deal with spatial-temporal constraints in current conventional satellite platforms. Despite these satellites' superiority in terms of spatial-temporal aspect, the acquired images have a small number of spectral bands with broad bandwidths that reduces their capabilities in some analyses, such as the monitoring of the vegetation seasonality and the high-sensitive distinguishing and detection of ice and snow [9]. Sentinel-2, on the other hand, includes a larger number of narrower bands (13) mainly in the red-edge domain, making it suitable for vegetation monitoring. As a matter of fact, Planetscope and Sentinel-2 have different but complementary proprieties. Planet has a high spatial-temporal resolution but a low spectral one, while Sentinel-2 has a high spectral resolution but a low temporal resolution. In recent years, many fusion methods have been introduced, attempting to integrate satellite images from various sensors at different resolutions to produce daily satellite images with the best resolutions. For instance, the generation of daily Sentinel-2 images is fruitful for many applications requiring high spectral and temporal resolutions. One can cite disaster monitoring [10], crop growth dynamics monitoring [11], early stage anomaly detection [12], and change detection in vegetation area [13], in general, to permit any kind of early crop monitoring practices. Therefore, the production of time series of high spectral resolution data on a daily basis, thanks to Sentinel 2-Planetscope fusion, is crucial and that can be possible only using an effective multisource multitemporal fusion technique.

A. Related Works
The generation of data that includes simultaneously the complementary properties of two kinds of satellite images can offer more informative products suitable for several RS applications, especially for monitoring rapid change areas. For that matter, multisource and multitemporal data fusion techniques have emerged to overcome the limits of a single sensor and introduce a possible solution in a cost-effective manner [4]. Over the past years, several works have been proposed to deal with the multisource and multitemporal fusion problem. According to Chen et al. [14], most of these methods can be grouped into three main categories: reconstruction-based techniques, unmixing-based techniques, and learning-based techniques. Regarding the reconstruction-based techniques, the fused synthetic image is generated via a weighted sum involving appropriate filters of spectrally similar neighboring pixels of the input data. Spatial temporal adaptive reflectance fusion model (STARFM) [15] is the pioneer algorithm. It aimed at combining Landsat and MODIS data to produce daily synthetic Landsat images at 30 m spatial resolution. Since then, several works have been developed to ameliorate STARFMs efficiency [16]. However, this category may lack efficacy in estimating the desirable image when a land-cover change type, occurred as the prediction of such change based on similar pixels of input images, remains difficult. The second category that is related to the unmixing-based methods generally includes the following steps: 1) clustering of available fine resolution images at a prior dates; 2) linear spectral unmixing of the pixels of the coarse images; 3) generating of fused images by substituting the spectral information using the unmixing model at the desired date. Zhukov et al. [17] introduced the first unmixing-based technique to combine multisensor input images captured at different dates. Based on this work, many works were introduced to ameliorate the fusion performance [3], [18], [19]. Besides, Li et al. [20] introduced a time-effective approach to accelerate the fusion process while maintaining satisfactory accuracy, confirming that the literature approaches are usually time-consuming, which may be improper for practical applications. Nonetheless, these techniques suffer from large errors estimation of endmembers unmixing and insufficiency of within-class variability of the fine-scale pixels inside a single coarse one [3]. Therefore, they may not be effective in detecting endmembers' changes within a coarse pixel due to a land-cover change type. The third category includes learning-based fusion techniques. They were developed based on the machine learning mechanisms, including sparse representation [21], dictionary learning [22], extreme learning [23], and artificial neural networks [24]. This category aims to learn a mapping between prior multisensor and multitemporal image pairs, which will then be used to estimate desired images on the prediction dates. It is worth mentioning that some works [25], [26] proposed an integrated spatiospectral-temporal fusion framework to combine multisource data with different spatial, spectral, and temporal resolutions. It generally employs a maximum posterior probability to define an inverse fusion problem. However, this approach relies upon a model optimization that, in turn, relies on prior knowledge. This process makes the product quality questionable, which can limit its exploitation for practical RS applications.
Over the past few years, deep learning, mainly, convolutional neural networks (CNN), have achieved impressive success in many computer vision applications, including image segmentation, image denoising, and super-resolution (SR). SR aims to increase the spatial resolution of low-resolution images to produce high-resolution images, which is almost the same goal of the multisensor fusion. Inspired by the state-of-the-art SR-CNN [27], Dong et al. proposed a two-stage CNN-based approach [28] to learn a complex mapping from MODIS coarse images to Landsat fine images. Liu et al. [29] introduced a CNNbased technique called spatial-temporal fusion two-stream network (StfNet), which employs residual learning of the difference between available and desired dates using SR-CNN architecture. Later, Tan et al. [30] introduced an effective generative adversarial networks spatiotemporal fusion model, termed GAN-STFM, aiming to limit the model inputs to only one pair of coarse-fine resolution images. It is clear that learning-based approaches, particularly CNN-based ones, have boosted the fusion performance with respect to the traditional fusion methods. However, the latter deal with the multisensor multitemporal fusion as an SR task and this strategy would involve significant drawbacks from the point of view of fusion quality. One can mention, in multisensor fusion, the reconstruction scale generally ranges between 8 and 16, which is considered as a large gap compared with SR (ranging from 2 and 4). Consequently, these methods cannot effectively extract texture details required to reconstruct the fused images [31]. Besides, CNN-based methods are borrowed from the pioneer SR-CNN architecture, which is shown to be insufficient for generating enough high-frequency detail due to its shallowness (includes only three layers) [32]. Moreover, SR-CNN was significantly outperformed by advanced architectures, such as deep residual networks [33] and attention mechanism [34]. It should be stressed that this kind of approach deals mainly with the spatial-temporal fusion problematic, which aims to combine satellite images of high spatial but low temporal resolution, such as Landsat and Sentinel-2, with images of lower spatial but higher temporal resolution images, such as MODIS and Sentinel-3, to synthesize high spatial-temporal data. Currently, spatial-temporal fusion represents the main approach to generate daily Sentinel-2 images, as it exploits freely available public satellite data. However, such a fusion category is considered a different and challenging task compared with SR and traditional RS fusion approaches (e.g., pansharpening) due to the following factors.
1) Resolution factor: The scale ratio in the spatial-temporal fusion ranges from 8 to 16, which is higher than the resolution ratio in SR and pansharpening that generally ranges from 2 and 4. Such a high ration can be problematic and leads to less fusion performance, in particular, when a borrowed SR model is explicitly applied to learn the end-to-end mapping. 2) Temporal factor: In spatial-temporal fusion, the inputs are captured at different dates, which make the problem even more complex, contrary to SR and pansharpening, where the images are acquired at the same time by the different modalities. 3) Spectral factor: Contrary to natural images used in the traditional SR problem that include only three bands, satellite images may include multiple bands covering different regions of the optical electromagnetic spectrum. To the best of the authors' knowledge, it should be noted that there is no work in the literature addressing the spectraltemporal satellite image fusion capable of generating products with high spectral resolution on a daily basis.

B. Motivation
To tackle the drawbacks of the spatial-temporal category and benefit from the complementary spectral-temporal relationship between Planetscope and Sentinel-2 satellites, in this article, we propose the pioneer effort to deal with the spectral-temporal fusion of Planetscope and Sentinel-2 images to produce daily Sentinel-2 products using a novel deep two-stream spectraltemporal fusion technique, residual attention mechanism, and a reconstruction network using a learned weighted-sum strategy, called STA-Net. STA-Net mainly aims at integrating high temporal low spectral resolution Planetscope images and low temporal high spectral resolution Sentinel-2 images to produce high spectral and temporal products. This process can generate daily Sentinel-2 data with high accuracy. More specifically, this article makes the following contributions.
1) MODIS that has widely used in the state-of-the-art has a high-frequency coverage but a coarse spatial resolution of 250 m, which makes the estimation of Sentinel-2 at 10 m a complex task. In contrast, Planetscope can produce daily images with 3 m resolution. Its resolution can ease the fusion process and generate more accurate Sentinel-2 data. 2) Instead of using the basic SR-CNN [27], which is outperformed by deeper CNN, we adapt a deep CNN to boost the fusion performance, allowing the network to learn more complex structures at multiple levels of abstractions [35]. 3) An end-to-end two-stream architecture based on residual attention blocks (RABs) is proposed to extract relevant features from a Sentinel-2 image in prior date and Planetscope one at prediction date, separately. The temporal difference estimator (TDE) focuses on learning the temporal difference, whereas the reflectance difference estimator (RDE) concentrates on learning the reflectance difference between Planetscope and Sentinel-2 images. Next, a reconstruction block is introduced to generate the final Sentinel-2 image via a learned weighting-sum manner. 4) A novel loss is developed to ensure that the estimated output is as close as possible to the target involving the two-stream outputs. Also, it penalizes bias error in the predicted image to guarantee high spectral quality. 5) The generated Sentinel-2-like data can be exploited in several agricultural contexts for monitoring different phenomena that require a high spectral resolution with dense time series.

C. Article Outline
The rest of this article is organized as follows. Section II provides the background of the attention mechanism. Section III presents the proposed spectral-temporal fusion method STA-Net. Section IV describes the considered datasets and gives the results and discussions. Finally, Section V concludes this article.

A. Attention Mechanism
Over the past few years, after the successful application in machine translation task [36], attention mechanism has received great attention from the machine learning community, and it is now considered as a vital part for various deep neural network models for several applications of machine translation [37], speech recognition [38], and computer vision [39]. The intuition behind the attention mechanism can be understood using human biological systems, as the human visual system has the tendency to focus on adequate information while ignoring the irrelevant one in a way that can help in perception [40]. Besides the improvement of performance on several applications, attention mechanism has been widely used for enhancing the interpretability of neural networks, which are treated mostly as a black-box model [41] since it is challenging to interpret precisely how the output is inferred from the input.
1) Channel Attention (CA): CA [42] represents a meaningful application of attention mechanism in which each feature map is associated by a specific weight that defines the degree of relevancy of each feature map. CA was employed on RS pansharpening [43] to generate high resolution multispectral images, which allows the network to focus on the pertinent features from the multispectral and panchromatic images. CA can assist the CNN to pay more attention to important features and less focus on the less relevant ones, which leads to more effective feature extraction. Let . , x C ] be a feature maps with C channels with size of H × W . A global average information operation (H GP ) is first applied to the feature maps to aggregate spatial information of each channel, which can be calculated as follows: where x c (i, j) denotes the pixel value at (i, j) in the cth channel x c . Next, the outputs pass through a gating mechanism that includes two fully connected layers (G 1 and G 2 ), which can be expressed as follows: where f () and δ() denote the sigmoid and ReLU activation function, respectively. Sigmoid is applied to define the importance degree of each channel of the feature maps by assigning a weight value between 0 and 1. CA is illustrated in Fig. 1.

III. PROPOSED METHOD
In this work, Sentinel-2 and Planetscope images are considered to validate the proposed approach. Let S be a Sentinel-2 image of b s bands and P be a Planetscope image of b p bands. Both images were captured within the same geographic region. We proposed a spectral-temporal fusion technique aiming at estimating a Sentinel image (S t ) captured at time t from an associated P t image captured at the same time, and a pair of Sentinel-Planet image (S t−1 and P t−1 ) captured at a prior date t − 1. As a result, we generate products with a high spectral resolution with frequent coverage. It should be noted that Sentinel-2 and Planetscope have different resolutions. Besides the spatial difference, there is also a spectral difference not only in band numbers but also in spectral wavelength range even for the similar overlapped bands (i.e., RGB and NIR). From theoretical perspectives, the fusion process is expected to be easier for the similar bands than the nonoverlapped ones, which are supposed to be more challenging due to the difference on both spatial and spectral proprieties. Indeed, the proposed method and the integrated spatial-spectral-temporal framework [25] are similar in nature as both can produce images with high spectral and temporal resolutions. However, the former has two significant characteristics that make it different from the latter. First, on the one hand, the integrated framework requires three different kinds of data with complementary spatial, spectral, and temporal properties. On the other hand, the proposed method necessitates only two different modalities with complementary spectral and temporal resolutions, which makes it easier and more suitable for real-life applications. Second, the integrated framework needs the definition of a complex spatial-temporal-spectral relationship for different input data, as it is based on the maximum posterior probability criterion. However, since such prior knowledge is not always available, the proposed approach is deep-learning-based; hence, it does not require establishing such a complicated relationship as it tries to learn it automatically.
Unlike most learning-based techniques, the proposed two-stream spectral-temporal based on attention mechanism, referred to as STA-Net, requires only one pair of images at a prior date rather than two pairs of images at prior and posterior dates [28], [29], which makes the proposed approach more suitable to generate fused products without waiting for posterior date, in particular, for estimating a Sentinel-2 image at the current date. To make the most of the available information, STA-Net predicts the unknown Sentinel-2 image in a two-steam manner involving two stages. On the one hand, the first stream estimates S t by learning the unavailable temporal changes between S t and S t−1 . On the other hand, the second one estimates S t by learning the unknown difference between S t and P t to map the Planestcope image into Sentinel-2 product. Next, a reconstruction network ingrates the two-stream outputs to produce the final fused product via a learned weighted sum. The general flowchart of STA-Net is illustrated in Fig. 2.

A. First Stage
Two-stream architectures were successfully applied to several tasks [44], [45], including image fusion [46], [47], which have access to two kinds of information characterized by different and complementary proprieties. Inspired by this strategy and believing that S t−1 and P t contain different and complementary information as they were acquired by different sensors and included different spatial, spectral, and temporal resolutions, by which it is possible to produce a Sentinel-2-like images by learning a complex mapping using an appropriate CNN, we introduced a two-stream CNN based on attention mechanism. The two streams have the same objective, i.e., producing a Sentinel-2-like images but using different concepts.
1) First Stream: TDE: TDE aims to generate the first intermediate fused product (I 1 ) by learning a complex mapping (φ 1 ) of the temporal changes. Instead of taking roughly P t and S t−1 as inputs, which does not make a full use of the available information since P t−1 is also available, which, combined with P t , can assist the CNN to learn valuable features, this stream includes two inputs: S t and P t − P t−1 (D t,t−1 T ), which are concatenated to act as a single input. Since S t and S t−1 can be highly correlated, a residual learning is employed to learn only the residual difference between S t and S t−1 . This difference represents the temporal change within the study area, which should be added to S t−1 to reconstruct the first intermediate fused image I 1 . The residual learning is proven to improve the accuracy and ease the training [48] compared with the traditional stacked convolutional layers. Besides, it provides more interpretability to the fusion algorithm. The image generated by the first stream can be summarized by the following formula: where I D T indicates the temporal difference image that needs to be inserted into S t−1 to produce I 1 , and θ 1 denotes the network's parameters to be trained.
2) Second Stream: RDE: The majority of works predict the desired image using a mapping into S t or by estimating the difference image that need to be injected into S t−1 . This strategy alone may not lead to an effective performance, especially when considerable changes occur within the area. In our work, assuming that Sentinel-2 and Planetscope images are captured within the same region but having different reflectance responses as they are acquired via different sensors, the second stream, called RDE, aims to reconstruct a Sentinel-2-like image (I 2 ) by learning a complex mapping (φ 2 ) from P t . In other words, this network learns the difference between Planetscope and Sentinel-2 images with the aim to transform the Planetscope image into a Sentinel-2-like image. However, such a strategy cannot be applied explicitly due to the difference in band number between both constellations, in particular, for the additional red-edge ones of Sentinel-2 that are unavailable in Planetscope products. Therefore, aiming to adjust the equivalent spectral bands, the latter Sentinel-2 bands are estimated using the closest Planetscope bands in terms of root mean squared error (RMSE) forming eight band versions of Planetscope images (P t and P t−1 ). Sentinel bands: B2, B3, B4, and B8 are estimated by the associated Planetscope bands Blue, Green, Red, and NIR, respectively, whereas Sentinel-2 red-edge bands: B5 and B6 are estimated based on the Planetscope red band, and B7 and B8a are predicted via the Planetscope NIR band. Aiming to ease the learning process for the network, we provide two inputs for the CNN:P t and S t−1 − P t−1 (D t−1 R ). D t−1 R provides additional accessible knowledge to the network, as it includes the reflectance difference at a prior date, which can assist the network in learning the mapping from the inputs to S t . The image produced by this stream can be expressed as follows: where I D R denotes the radiometric difference that must be injected intoP t to produce I 2 , and θ 2 represents the network parameters to be optimized. The detailed architecture of each stream is shown in Fig. 3. Each stream has the same architecture but different weights as it was trained using different inputs. Each stream includes three main parts: shallow feature extraction, deep feature extraction, and difference reconstruction part. First, one convolution is performed to extract shallow features from the corresponding inputs of each stream. Next, four RABs are applied for deep feature extraction. The used attention block is described in Section III-A3. Finally, from the elementwise sum of shallow and deep feature, two convolutions are used to estimate the residual difference that needs to be inserted into S t−1 /P t to produce I 1 /I 2 for TDE/RDE streams, respectively.

3) RABs:
It has been shown that residual blocks can be utilized to develop effective deep CNN [49]. However, since the traditional residual blocks apply equal attention to all features, this kind of network is generally difficult to train and reconstruct the high-frequency details [34]. To overcome, the attention mechanism has been proposed in [42], and it offers complementary characteristics. It can focus on more informative features and ignore the useless ones, which help the networks to easily capture the important features and reconstruct finer texture details. Inspired by this trend [34], [50], we proposed an RAB, which combines the effectiveness of attention mechanism and the regular residual blocks. The architecture of a single attention block is shown in Fig. 4. The latter includes two parts to model two kinds of information suitable for spectral-temporal fusion: CA and spatial attention (SA), as satellite images include lowand high-frequency components. The latter provides valuable information that represents edges, texture, and other kinds of details. Focusing on such components can be beneficial for the network to reconstruct the desired fused product. Accordingly, to pay more attention to high-frequency information, CA, as described in Section II-A1, is introduced to exploit better each channel of feature maps. This strategy can prioritize the channels with more relevant information. It is common knowledge that channels in each feature map can include different representations based on the applied filter's objective. For instance, some filters can capture horizontal edges, and other filters extract vertical ones, and obviously, each of them plays a significant role in reconstructing the fused product. Trying to separate the spatial information and depthwise one, we performed an SA using a depthwise convolution [51] to exploit spatial interdependencies of each channel while maintaining channel-specific characteristics. Contrary to regular convolutions, which are applied over multiple channels, depthwise convolution traits each channel individually to produce two-dimensional feature maps for each one. This part can be expressed as follows: where f depth denotes the depthwise convolution operation via three kernels. The final output of an attention block combines the spatial and spectral attention outputs and can be expressed as follows:X whereX represents the final output of the RAB, B CA and B SA denote the outputs of CA and SA, respectively, and ⊕ and ⊗ indicate the elementwise sum and product, respectively. The RAB is composed of successive stacked attention blocks.
Assuming that the output of the ith RAB is F i , the latter can be calculated as follows: where H AB (.) indicates the operation of RAB, and F 1 i+1 is the feature maps output of C channels after the application of convolution, ReLU, and convolution on F i .

B. Second Stage: Reconstruction Network
At the end of the first stage, two outputs are generated, each of them has complementary features. To make the most of the latter, the fusion stage aims to extract the hierarchical characteristics of the two outputs I 1 and I 2 to catch complimentary properties and produce the final fused product (I F ). To this end, inspired by CNNs learning capacity, we introduced a reconstruction block to merge the two outputs to recover the desired image. Instead of predicting the latter directly in a black-box manner lacking physical interpretability, this stage blends the two inputs by learning the appropriate pixelwise weighted sum, which aims to guide the network to select the best pixels that boost the fusion performance. The intuition behind this strategy is that the performances of two-stream's outputs vary depending on the spatio-temporal features. For instance, areas with minor changes are better preserved by the STD as S t is almost equal to S t−1 , whereas the ones with significant changes and low spatial variation are better reconstructed by the second stream (RDE) since S t and P t are highly correlated. The final fused product can be expressed by the following formula: where W represents the outputs of the network that represents the learned weights, · denotes the pixelwise product, I i indicates the intermediate fused product of the ith stream, I F represents the final fused product, and θ F indicates the network's parameters to be optimized. The architecture of the second stage is illustrated in Fig. 5. First, the two-stream outputs are concatenated to form a single input to pass through two convolution layers to extract features that encourage the network to select the best pixels of the input. Next, two parallel convolutions are applied to estimate the appropriate weights for the associated intermediate fused products I 1 and I 2 , respectively. Aiming at blending the latter products via a weighted sum, the estimated weights are multiplied pixel-by-pixel by their associated fused images to choose the best value for each pixel. The results are added together at pixel level to produce the final fused image I F .

C. Proposed Loss
The design of the loss function is vital for network training and prediction. Therefore, unlike some CNN-based techniques [28], [29] that are often time-consuming during the training process since they train each part of the network separately, STA-Net employed a combined loss function to optimize the network's parameters in an end-to-end manner. This strategy updates all the network parameters simultaneously via a single loss function, which leads to fast and accurate fusion results. The objective of the training is to optimize the following loss function: where α 1 , α 2 , and α 3 represent the loss weights used to equilibrate the contribution of each part, and each part is defined as follows: where S t indicates the reference Sentinel-2 image, i denotes the sample index of minibatch of N s , and the bar operation (.) indicates the mean value. The first part is the mean absolute error (known as l 1 or MAE) between the final predicted and the reference images, aiming to obtain fused images as close as possible to the reference images. As the final output combines the two-stream intermediate results, the second part encourages the first stage networks to produce Sentinel-2-like products similar to the reference ones via a mean squared error loss (l 2 ) as well as to allow an end-to-end training. Concerning the third part, as Planetscope and Sentinel-2 images have different radiometric responses, it is used to ensure that the intermediate predicted images have the same mean as the reference images to preserve their spectral information. The MAE (l 1 ) is used for the first part, as it provides better performance than l 2 loss and better convergence behavior [49], which guarantees the best fusion result for the final fused product. At the same time, l 2 is utilized for the other parts because it is less sensitive to the variation than l 1 . This is mainly the case when the two images are very similar to each other, which allows the network to give more focus on the final predicted product while ensuring competitive performance for the two-stream outputs.

A. Datasets
To assess the proposed technique's fusion performance, two datasets acquired by Planetscope and Sentinel-2 satellites, within the same area and the same date, are used for the training and the evaluation procedure, respectively. The first dataset (denoted as Sfax dataset) was captured over the region of Sfax city, Tunisia (35 • May 15, 2020 andNovember 21, 2020. This period between the acquisition allows the apparition of significant phenological changes due to the growth of plants and different types of vegetation as well as shadow variation due to the sun inclination variation. For each dataset, the first date represents the images at a prior date (t − 1), and the second date indicates the desired image at t that needs to be estimated, which is used as a ground-truth image for evaluation of the fusion product. The size of each training dataset is 2100 × 2100 pixels at Sentinel-2 10 m scale. Regarding the evaluation process, for each test dataset, 25 images of size 256 × 256 pixels are chosen to assess quantitatively the performance of the proposed approach. Furthermore, a qualitative evaluation was carried out visually using one scene from the selected ones.
For two constellations, Sentinel-2 and Planetscope, the products with high processing levels are considered; level L2A with Bottom of Atmosphere reflectance for Sentinel-2, which includes an atmospheric correction, the Analytic Ortho Scene (3B) for Planetscope. Regarding Sentinel-2 products, in this work, eight spectral bands are considered. The broad spectral bands: B2 (Blue 458-523 nm), B3 (Green, 543-578 nm), B4 (Red, 650-680 nm), and B8 (NIR, 785-900 nm) with 10 m ground sampling distance and the vegetation red-edge bands: B5 (Red-Edge 1, 698-713 nm), B6 (Red-Edge 2, 733-748 nm), B7 (Red-Edge 3, 773-793 nm), and B8a (Narrow NIR, 855-875 nm) with 20 m, which are valuable for several vegetation study applications, such as identifying vegetation types [52] and detection of crop disease [53]. For Planetscope, the accessible four spectral bands (blue, green, red and near infrared) at 3 m are used. All selected datasets are cloud-free, geometrically corrected images. Sentinel-2 bands at 20 m were resampled to fit the resolution of 10 m bands. Besides, Planetscope images at 3 m were upscaled to 10 m to fit the Sentinel-2 resolution. Although the downsampling of Planetscope images to 10 m may lose some spatial details, it is more suitable for our approach to process with such a resolution for three reasons. First, our model aims to generate Sentinel-2 at 10 m, which is the highest spatial resolution of Sentinel-2. Second, our technique is not supposed to be sensitive to minor changes that are smaller than 10 m resolution, and they should be ignored. Third, the fusion performance remained approximately the same for both resolutions 10 m and 3 m. As postprocessing, the resampled bands of Sentinel-2 (i.e., B5, B6, B7, and B8a) will be downsampled using a Gaussian low-pass kernel that mimics the modulation transfer function to restore their original resolution of 20 m. Table I summarizes the spatial, spectral, temporal resolutions and the required preprocessing of Sentinel-2 and Planetscope used in this work.

B. Implementation Details
For the training stage, the training images were cropped into patches of size 41 × 41 pixels and generating, therefore, 3000 samples for allowing the training process. A convolutional filter of size 3 × 3 was set in all weight layers of the network. Regarding the optimization, the network was trained for 1200 epochs (25 818 iterations) and optimized via Adam optimizer [54] with β 1 = 0.9 and β 2 = 0.999. The batch size was set to 64 during the training process. The loss's weights α 1 , α 2 , and α 3 were empirically set to 1 in the present work. These weights are set empirically instead of being learned, as we noticed that this strategy can lead to more stable training and best fusion performance. The learning rate was first initialized to 10 −4 and divided by 10 every 300 epochs. The network was implemented and tested through NVIDIA Titan Xp GPU with 32 GB of RAM. The training stage is achieved when the loss does not improve for 50 epochs. In the prediction phase, as our STA-Net processes images of arbitrary size respecting the limit of GPU's memory, the tested image was predicted without the need for cropping, unlike the training phase.

C. Quality Assessment
Quantitative validation represents an indispensable step for evaluating and comparing each fusion technique. Thanks to the existence of Sentinel-2 images at the desired dates that serve as reference images, it is possible to evaluate the fused images with their associated target ones in a full-reference method. For this reason, several full-reference metrics have been proposed  II  COMPARISON OF FUSION PERFORMANCE ON COLEAMBALLY DATASET  DEPENDING ON THE EMPLOYED ARCHITECTURE   TABLE III  COMPARISON OF FUSION PERFORMANCE ACHIEVED BY DIFFERENT NETWORK'S WIDTH to measure the spectral and spatial quality of fused products. In this work, the fusion performances have been evaluated through four highly used metrics, including the RMSE, the correlation coefficient (CC) [55], the spectral angle mapper (SAM) [2], the structure similarity (SSIM) [56], and the universal image quality index (UIQI) [57]. In addition to the quantitative validation, a qualitative assessment was performed via a visual inspection to visually evaluate the fused product, which helps identify other kinds of spectral and spatial distortions, which may not be noticed in a quantitative manner.

D. Ablation Study
Aiming to investigate the influence of the network's components, an ablation study was performed to show the effectiveness of the proposed method as well as to select the optimal parameters that ameliorate the fusion accuracy. More precisely, such a study intends to assess the direct impact of two-stream architecture, weighted-sum strategy, loss function, and the employed attention blocks to gain in fusion efficiency.
1) Influence of Two-Stream Architecture: The use of twostream architecture is one of the main contributions of this work. Therefore, trying to show the effectiveness of the proposed two-stream architecture over a one-stream one, we implemented a one-stream network by stacking the inputs of each stream on the original method into a single input. Table II describes the achieved quantitative fusion results on Coleambally dataset depending on the employed architecture. The proposed two-stream architecture shows a better fusion ability than the single-stream one in all aspects. It achieved the best scores in all metrics, proving the suitability of a two-stream architecture in the proposed method.
2) Influence of the Network's Depth: An ablation study was also conducted to investigate the impact of the network's depth (i.e., the number of attention blocks) on the fusion performance. It is known that the number of the network's parameters grows linearly with the depth. Therefore, we should carefully find the best tradeoff between the fusion performance and the network's depth. The quantitative results illustrated on Table III show that fusion performance and the depth of the model have a positive relationship. However, this trend was downward after 3) Influence of Weighted-Sum Strategy: Employing a weighted-sum strategy to produce the desired fused images represents another originality of the present work. Attempting to examine the impact of this strategy to improve the fusion accuracy of the proposed method, we compared the latter with a typical strategy that does not employ any weighted-sum strategy. In other words, it produces the desired image directly from the intermediate fused products via the second stage's reconstruction network. Besides, it was compared with the intermediate images: I 1 and I 2 , generated by TDE and RDE, respectively. Table IV illustrates the obtained fusion score on Coleambally dataset using the considered mechanism. It may be seen that the proposed weighted-sum strategy is highly advantageous over the typical ones since the former produces the best fusion accuracy in all considered aspects. Furthermore, such a mechanism can further boost the fusion performance by combining the intermediate images in a learned weighted-sum manner into a single, more accurate product. Consequently, including a weighted-sum strategy in the last layer can effectively enhance the fusion results, which further proves the effectiveness of the proposed method.

4) Influence of Loss Function:
Aiming to evaluate the influence of loss function to enhance the fusion quality, we compared the proposed loss function with l 1 and l 2 loss functions, which were largely used in the literature in several image enhancement applications, especially satellite images fusion. Table V illustrates the fusion scores obtained by the compared loss function on Coleambally dataset. It can be seen that l 1 achieves higher fusion scores than l 2 , which shows the significance of employing l 1 as a principal part of the proposed loss. However, the proposed loss function offers the best fusion results in all bands in terms of RMSE. The obtained scores prove the effectiveness of the proposed loss function to gain in fusion performance. In particular, it shows the importance of optimizing the network's parameters using a combined loss function that considers the output of each stream.

5) Influence of RABs:
In this experiment, we analyze the impact of the chosen blocks as they play a significant role in

E. Quantitative Validation
To evaluate the fusion performance of the proposed technique, the latter was compared with the reconstruction-based approach STARFM [15], the common CNN baseline for image processing, SRCNN [27], trained for spectral-temporal fusion, and the well-established spatial-temporal fusion methods based on deep learning: two-stream convolutional neural network for spatiotemporal image fusion (StfNet) [29] and GAN-STFM [30], adapted to deal with spectral-temporal fusion. Both SRCNN and StfNet were implemented and trained by ourselves, and all the parameters of these techniques are set as described in their original papers to ensure the optimal performance. As STARFM combines satellite images with similar spectral properties and requires the same number of input and output bands, the additional Sentinel-2 bands are estimated via the closest Planetscope bands in terms of RMSE. Sentinel-2 bands: B2, B3, B4, and B8 are estimated by the corresponding Planetscope bands Blue, Green, Red, and NIR, respectively. On the other hand, the remaining Sentinel-2 red-edge bands: B5 and B6 are estimated based on the Planetscope red band, and B7 and B8a are predicted via the Planetscope NIR band. Tables VII and VIII describe the quantitative scores of the considered fusion techniques and the associated S t−1 on Sfax and Coleambally datasets, respectively. As it can be seen, the correlation between the observations at the two dates is low, for both datasets, because of the long period between the two acquisitions, which can make the fusion task more complex for the CNN-based approaches to learn from the data. As expected, the traditional approach: STARFM achieves the weakest scores in terms of quantitative quality as it cannot address the phenology changes in the homogeneous regions. SRCNN, StfNet, and GAN-STFM obtain an RMSE mean of around 0.01 on both datasets. STA-Net enhances the accuracy obviously, with an RMSE mean of roughly 0.007. SSIM mean scores calculated from the eight estimated bands reached the highest values of 0.97 for the proposed approach, indicating that the fused and reference images have the best structural similarity. Besides, STA-Net offers the highest scores in terms of CC and UIQI, which denotes an effective reconstruction of small-size structures [59]. In terms of spectral fidelity measured using SAM index, STA-Net conserves better spectral signature, as it produces the best SAM results among the compared techniques. Surprisingly, the fusion accuracy of B5 and B6 bands surpasses the one of B4 and B8 that have corresponding bands in Planetscope (RGB and NIR). This phenomenon may be due to the difference of the spectral characteristics of B4 and B8, and the corresponding Red and NIR bands of Planetscope, as there is only a partial overlap between them (see Section IV-A). All the aforementioned results indicate that the proposed technique offers the best fused products in terms of spatial, spectral, and radiometric properties.

F. Qualitative Validation
The visual inspection is considered a fundamental step to validate each fusion approach along with the quantitative validation. It can highlight different kinds of noticeable distortions and artifacts on the fused images, which help compare the performance between the considered fusion techniques. Figs. 6 and 7 illustrate the fusion results from the considered techniques on the Sfax and Coleambally datasets, respectively, along with the reference Sentinel-2 image (S t ) and its associated Planetscope product at the same date P t in addition to Sentinel-2 image at the prior date (S t−1 ). It can be seen at first sight that the visual results are in line with the quantitative observations from Tables VII and VIII. We can observe that all the methods are able to estimate the phenological changes occurred between the desired and prior dates. STARFM suffers from a serious blurring effect and lost a lot of detail in some heterogeneous areas. Besides, high spectral distortion is noticed in some parts of the image, as the color appears different from the reference image (red rectangle). Regarding SR-CNN, its fused product suffers from a serious blurring effect on the whole image and lost a lot of detail in some heterogeneous areas. Besides, a significant spectral distortion is noticed in some parts of the fused image, as the color appears dissimilar from the reference image (blue rectangle). StfNet, on the other hand, yields better performances than SR-CNN but lacks spatial details in some  regions (blue rectangle) even if the color is better preserved than the latter on both datasets. GAN-STFM offers better details reconstruction on Sfax dataset (highlighted in red) but generates a significant spectral distortion on Coleambally dataset as the color is bluer on the whole image compared with the reference image. The proposed method STA-Net achieves better fusion accuracy than the aforementioned methods in terms of spatial and spectral quality, as our fusion result is the closest to the reference image without any noticeable distortion or artifacts within the fused product in terms of visual quality. Besides, the colors are well-preserved by the proposed method than SR-CNN and StfNet, as we can see that the color in StfNet is different from the original image (blue rectangle). In terms of spatial information, StfNet lacks structural detail because its shallow architecture does not allow capturing sufficient highfrequency details, in particular, in heterogeneous areas and the edges of the areas highlighted in red. Regarding the proposed technique, since the network is deeper and benefits from the attention mechanism to select the best features, the details and contours are well reconstructed with sharper edges (e.g., regions highlighted in red). From the above-mentioned comparisons, it can be concluded that the two-stream strategy via attention mechanism can boost the fusion performance to produce more accurate products.

G. Limitations of the Proposed Method
The proposed method provides very competitive results not only for spectral-temporal fusion but also for generating data with high spatial and temporal resolutions (cf. Section 1.2 of the supplementary material). These results show the extensibility aspect of our work to deal with different satellite image fusion problems. However, as with the majority of works, the proposed method is subject to some limitations. Mainly regarding the applicability of the proposed method to fuse other modalities and different kinds of data. One can cite the behavior of our method to generate dense time series of nonreflected data, such as land surface temperature (LST) one, that includes thermal bands [60], [61], as it is recommended for climate change monitoring applications.

V. CONCLUSION
In this article, we introduced an STA-Net, an end-to-end two-stream fusion technique based on RABs via an effective loss function to integrate Planetscope and Sentinel-2 images. The proposed approach includes two stages. In the first stage, based on RABs, TDE predicts the temporal residual between the actual Sentinel-2 at the desired and prior dates. Simultaneously, the RDE estimates reflectance difference between Sentinel-2 and Planetscope images. Hence, two intermediate fused images are produced by injecting the corresponding temporal and reflectance differences, respectively. The second stage aimed to reconstruct the desired fused product via a learned weighting sum to combine the two-stream outcomes. An effective loss is introduced that involves the two-stream outputs to guarantee the best performances. To the best of our knowledge, this is the first attempt to fuse Planetscope and Sentinel-2 images to produce daily Sentinel-2 images using such a network.
The experiments have been conducted on Planetscope and Sentinel-2 images using quantitative and qualitative evaluations on two datasets; it was shown that the proposed approach yielded the best fusion performances in terms of spatial and spectral information compared with the considered state-of-the-art techniques. In our future work, we intend to explore more advanced deep-learning models to ameliorate the fusion quality further while making the product more realistic. Besides, we plan to extend our approach to be capable of generating fused images at Planetscope 3 m resolution. Also, we intend to extend STA-Net applicability to produce LST data for dynamic monitoring and prediction in climate change tasks.