LFD-Net: Lightweight Feature-Interaction Dehazing Network for Real-Time Remote Sensing Tasks

Currently, remote sensing equipments are evolving toward intelligence and integration, incorporating edge computing techniques to enable real-time responses. One of the key challenges in enhancing downstream decision-making capabilities is the preprocessing step of image dehazing. Existing dehazing methods usually suffer from steep computational costs with densely connected residual modules, as well as difficulties in maintaining visual quality. To tackle these problems, we designed a lightweight atmosphere scattering model based network structure to extract, fuse, and weight multiscale features. Our proposed LFD-Net demonstrates strong interpretability by exploiting the gated fusion module and attention mechanism to realize feature interactions between multilevel representations. The experimental results of LFD-Net on SOTS dataset reach an average frequency per second of 54.41, approximately eight times faster than seven most popular methods with equivalent metrics. After image dehazing by LFD-Net, the performance of object detection is significantly improved. The mean average precision when IoU = 0.5 (mAP@0.5) based on YOLOv5 is improved by 4.73% on DAIR-V2X dataset, which verified the practicability and adaptability of LFD-Net for real-time vision tasks.

drones, vehicles, and ground monitoring devices, one can obtain comprehensive information about ground objects.However, real-time accurate information extraction in complex and highly dynamic conditions, such as traffic regulation, crime tracking, and disaster relief, remains particularly difficult [4], [5], [6].A key preprocessing step to improve the image quality is to remove the negative effects of prevailing haze, and it would be a good option to deploy dehazing algorithms on remote sensing terminal platforms, which could significantly reduce data transmission costs and achieve faster response.Therefore, it is necessary to propose a lightweight dehazing algorithm to remove the constraints of limited power and computing resources on edge devices, and optimize the dehazing efficiency while ensuring accuracy and reliability.
Dehazing methods for remote sensing images are mainly of three types: prior knowledge-based methods, physical modelbased methods, and deep learning-based methods.Most of the earliest dehazing methods are based on prior knowledge.For instance, dark channel prior (DCP) makes an approximation that haze-effected pixels have at least one relatively low intensity value among RGB channels [7]; a semiphysical guided-filterbased approach is adopted to refine the coarse haze thickness map to restore textural information [8]; depth estimation and image segmentation are incorporated with DCP to generate the final transmittance [9].These prior knowledge based methods are typically subject to empirical or statistical regularities, leading to limited application scenarios.
In addition, ASM has been extensively introduced in physics model based dehazing methods.It is physically grounded for an unrestricted access to various image scenes through the estimation of global atmosphere light and transmission map.For instance, an end-to-end DehazeNet combines dark channel, maximum contract, color attenuation as well as hue disparity prior to compute the transmission map and assigns a default value to atmosphere light [10]; a Haze Density Prediction Network is designed for a more accurate approximation of atmosphere light to better fit for nighttime occasions [11]; a multidecoder framework is presented to handle multiple bad weather restoration, with rain veiling effect embedded into the conventional ASM [12], and a differential guided layer is embedded with the backbone and substituted to the physical scattering equation [13].Approaches based on ASM are usually more lightweight, but they may produce unnatural color tones due to inaccurate estimation of atmospheric light.
Compared with traditional dehazing methods, deep learningbased methods gradually become the research hotspot due to their stronger modeling and generalization capabilities.Dehazing methods based on convolutional neural networks (CNNs) are extensively adopted, which will be discussed in detail in Section II.
Our proposed Lightweight Feature-interaction Dehazing Network (LFD-Net) utilizes convolutional layers of different kernel sizes as a sequence to extract multilevel features.The feature interaction process is addressed coherently by taking in, redistributing, and reassigning weights to the extracted features.Each component of our network performs its own function, but also interacts efficiently and effectively as a whole.Moreover, we utilize multiple metrics for evaluation, which are highly relevant and sensitive to remote sensing tasks.Overall, our main contributions are threefold as follows.
1) Our method employs ASM to jointly approximate the atmospheric light and transmission map to enhance image restoration capability and inference efficiency.It incorporates the convolutional operations into more specialized modules while maintaining the conciseness.2) Our proposed method is designed to provide interpretability by assigning distinct tasks to each module, as demonstrated by the results of our visualization and ablation experiments.The feature-interaction process relies heavily on elementwise multiplication, which has been shown to enhance the performance of pure convolutional operations.3) Our proposed method has been extensively validated across various scenarios of space-air-ground remote sensing land observation tasks to demonstrate its stability, practicability, and generalization capabilities.It can effectively address common challenges such as halo effect, gridding artifact, and color inconsistency, and achieves an excellent tradeoff between accuracy and efficiency, which considerably improves the performance of object detection.

II. RELATED WORK
The increasing prevalence of intelligent remote sensing devices that support real-time responses, as opposed to relying on data transmission to servers, has highlighted the importance of studying lightweight dehazing methods, which are crucial for context-aware and fast-response remote sensing systems.However, there exists a tradeoff between the efficiency and accuracy of lightweight dehazing approaches.Some approaches employ knowledge distillation [14], [15] or pruning techniques [16], which may sacrifice accuracy for efficiency.In contrast, other methods directly construct lightweight networks to address this issue.For instance, AOD-Net [17] serves as a baseline for other lightweight dehazing models by concatenating multilevel features using different patterns.FAOD-Net [18] and GAOD-Net [19] utilized depthwise and pointwise convolutions to reduce parameters and aggregate context information in a pyramid pooling module.FAMED-Net [20] employed cascaded and densely connected pointwise convolutional and pooling layers at multiple scales.LD-Net [21] tackles the semantic gap by concatenating convolutional layers and incorporates a Color Visible Restoration module to enhance color consistency.
Nevertheless, achieving a balance between high performance on specific datasets and generalization to diverse practical applications remains a central challenge.The design and evaluation of dehazing methods should consider this tradeoff comprehensively.While current methods may exhibit promising results under certain conditions, their lack of efficiency and generalization capabilities limit their suitability for real world and real-time applications.
Our proposed LFD-Net considers dehazing as an image reconstruction task with an emphasis on feature extraction and feature utilization processes, as discussed in Sections II-A and II-B.In contrast to stacking deep residual modules in these procedures, we employ the gated fusion and attention mechanism only once, which improves both efficiency and interpretability.Moreover, it is important to design comprehensive evaluation metrics for dehazing methods, as described in Section II-C.

A. Feature Extraction
One of the key challenges in image reconstruction is the extraction of multilevel or multiscale features, which can be facilitated by using a symmetric encoder-decoder structure.The U-Net architecture, originally designed for effective extraction of context information at different scales or levels [22], has been widely used as a backbone in various reconstruction tasks.In [23], the Strengthen-Operate-Subtract boosting strategy is incorporated into the decoder, and a dense feature fusion module utilizing a back-projection feedback scheme is leveraged to compensate the missing spatial information from highresolution features.In [24], the U-Net architecture is modified to incorporate discrete wavelet transform and inverse discrete wavelet transform in place of conventional downsampling and upsampling.In [25], hybrid convolution is applied in the U-Net encoder, which combines standard convolution with dilated convolution, to expand the receptive field and extract image features in more detail.
As opposed to a fixed backbone like U-Net, some methods utilize more flexible structures with multiple paths to diversify color information or perform various tasks.For instance, in [26], image dehazing and depth estimation are addressed simultaneously in a framework with four decoders sharing information from the same encoder.In [27], a multicolor space encoder that incorporates RGB, LAB, and HSV is applied to extract representative features in separate paths.In [28], quadruple color-cue is integrated into a multilook architecture with multiweighted training loss for autonomous vehicular application.These color spaces are often designed manually, which work well for specific applications, but may lack adaptability and generalization for others.

B. Feature Utilization
Another major challenge in image reconstruction tasks is the efficient utilization of extracted features, which has prompted the exploration of various feature fusion strategies and attention mechanisms.For instance, in [29], a novel attention-based multiscale estimation module is implemented in the backbone on a grid network to alleviate the bottleneck issue encountered in conventional multiscale approaches.In [30], a block structure integrated with channelwise attention (CA), pixelwise attention (PA) is stacked to form a group structure, which is progressively triple-stacked and concatenated to feed into another CA-PA attention mechanism for feature fusion.In [31], a multilevel fusion module is presented to integrate low-level and high-level representations.In addition, a residual mixed-convolution attention module is developed to guide the network to focus on significant features during the learning process.In [32], the feature fusion method progressively aggregates the features of hazy image and generated reference image to remove the useless features.
Moreover, the self-attention mechanism proposed in transformer has also been practiced in dehazing methods.For instance, a transformer-based channel attention module and a spatial attention module are combined to form an attention module that enhances channel and spatial features [33].Long-range dependencies of image information can be effectively extracted through transformer blocks in image dehazing [34].Recently, it has been revealed in [35] that self-attention mechanism inherently functions as a two-order feature interaction.In our method, gated convolution has been developed as an alternative method to achieve an competitive results to self-attention, while reducing the computational cost.

C. Quality Evaluation
Existing methods usually focus on high performance quantified by metrics in terms of peak-signal-to-noise-ratio (PSNR) and structure similarity index (SSIM).More specifically, PSNR measures the ratio between the maximum possible value of a pixel and the power of corrupting noise that affects the restoration fidelity.Instead of directly estimating absolute error, SSIM reveals interdependencies within pixels by luminance masking and contrast masking between spatially close image pairs.Besides, CIE2000 Delta E formula (CIEDE2000) and Spatial-Spectral Entropy-based Quality (SSEQ) are also introduced in our comparison metrics, because color and texture are significant for object recognition and terrain classification of remote sensing applications.CIEDE2000 is used to quantify the visual difference between two colors.It takes into account the chromaticity and luminance of the colors being compared, as well as the surrounding colors and the viewing conditions [36].SSEQ is calculated by separating the image into its spatial and spectral components, calculating and combining the entropy of each component [37].Halo effect in many remote sensing images can lead to significant degradation over large areas compared to high spatial resolution close-range images.In the comparison experiments, we calculate the average CIEDE2000 of each pixel in image pairs and the average absolute value of relative error on SSEQ (i.e., ΔSSEQ).

A. Preliminaries
ASM is employed in our method to overcome the difficulty of raw pixel prediction from reconstructed images via light model.It is physics based, more suitable for real-world scenarios, and where A is treated as a constant, t ∈ (0, 1] denotes the pixelwise transmittance of light, θ represents the pixel coordinate of an H × W image of height H and width W , with I and J being the hazy input and haze-free output, respectively.Therefore, the haze-free approximation J(θ) can be written as To encapsulate these two factors (i.e., A and t(θ)) into one variable, the formula of the reformulated ASM is as follows: where K(θ) represents the new incorporated variable, which can be derived as To be specific, K is the intermediate evaluation parameter of the network.The ultimate goal is to generate a separate K value for each input channel, typically in terms of RGB.That is, K in size 3 × H × W is substituted into (3) at the end of the network, with a most commonly used default value b = 1.

B. Network Design
The proposed LFD-Net distinguishes itself from both heavyweight and lightweight frameworks with its concise and effective approach to feature extraction and interaction, as shown in Fig. 1 and Fig. 2. To optimize the lightweight structure design of the LFD-Net, the gated fusion module and attention mechanism are used only once, instead of being incorporated as parts of more Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.complex blocks.This approach significantly improves efficiency while maintaining strong performance in dehazing tasks.
In CNNs, convolution kernels of varying sizes are used to extract features at different levels of abstraction.Specifically, smaller size kernels are effective at capturing local features, while larger size kernels are more suited for capturing features with larger receptive fields, which are considered as more global features.The most commonly used kernel size is 3 × 3.However, stacking convolutional layers with this typical kernel size are not efficient enough in lightweight models.The concatenation layers are utilized to combine the low-level and high-level features, which compensates the loss of information from the initial layers as the network proceeds deeper.Therefore, the formation of convolutional and concatenation layers is crucial and needs to be designed flexibly to meet specific needs.Different from existing methods, we further simplify the formation of convolutional layers during feature extraction.Based on this, we also introduce feature interaction strategies including the gated fusion module and attention mechanism.

A residual connection between Conv 1 and Conv 3 is utilized to refine feature representations between low-level and high-level features.
A concatenation layer, namely Concat 1, is applied to combine the multilevel features from the extraction process.These features are then fed into the gated fusion module for spatial interactions, which includes a convolutional operation, namely Conv 4. The output features are passed to the second concatenation layer, namely Concat 2, which progressively integrates the features extracted in Conv 3 layer.This is because higher level information is always more global, and thus being distributed to lower levels in the gated fusion module while performing feature interactions.This information is also indispensable for image restoration, especially for the following attention mechanism, which makes it necessary to involve the Concat 2 layer.The attention mechanism adaptively learns channelwise and pixelwise weights to enhance conducive features.After that, all features are fed into the high-resolution stage, which consists of two convolutional layers, namely Conv 5 and Conv 6, respectively.The details of the proposed method are illustrated in Table I.

C. Gated Fusion Module
Our proposed LFD-Net replaces densely connected residual blocks with effective feature-interaction-based strategies.The gated fusion module aims to perform two-order interactions among multilevel features.This idea is demonstrated in transformer-based architecture through two successive pixelwise products ( i.e., K, V ) [38].While transformers are effective, the computational cost is huge when dealing with low-level preprocessing tasks.Transformer-ensembled CNNs usually expand the flexibility of convolutional operations through adding dynamic weights to improve the modeling power of convolution [35], [39], [40].Similar techniques have been practiced in image dehazing methods [41], [42], but are still in need of further exploration and interpretation.
Our proposed method also takes advantage of pixelwise multiplication by directly implementing it to successive feature levels, the concatenation layer Concat 1 that combines the sequence of convolutional layers Conv 1, Conv 2, and Conv 3.For illustration, these features are denoted as F 1 , F 2 , and F 3 , respectively.In Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
addition, these three convolutional operations are denoted as C 1 , C 2 , and C 3 , and the ith feature map of the output layer is denoted as G i .The process of gated fusion module can be expressed mathematically as follows: where F k,i is the original ith feature map of the kth group.As shown in ( 5), the input of gated fusion module consists of three levels.The number of output feature maps reduces the input by one-third, equal to the number of feature maps in each level of the input.The gated fusion module enhances the features within a feature map with neighboring pixels and introduces interactions by dynamically assigning weights to other feature maps through pixelwise multiplication.This reinforces the ability of convolution to retain and utilize multilevel features in an intensive and expansive manner.

D. Attention Mechanism
According to (5), the gated fusion module adaptively enhances and interacts with multilevel features.However, in cases where the haze is unevenly distributed, as often occurs in aerial imaging, accurately assessing the extent and density of the haze region remains challenging.This can result in the presence of fancy shades or dark spots.Attention mechanisms, which have been designed to focus on distinctive parts when processing large amounts of information [43], can be utilized to address this issue in image dehazing.Specifically, CA selects the feature levels for features associated with the haze region, while pixelwise attention refines the selected haze region.In [30], attention mechanism [44] is integrated into a block structure and stacked in feature extraction process.While in our proposed method, the attention mechanism is utilized only once as a single module to finalize feature weights before the high-resolution stage, leaving a large space for weight adjustment.
The adopted attention mechanism is composed of channelwise attention (CA) and pixelwise attention (PA), as depicted in Fig. 3, serving as a compensation to the gated fusion module.All of the convolution operations used in the attention mechanism have a kernel size of 1 × 1, similar to a multilayer perceptron architecture, with global average pooling and channelwise mixing [45].In this mechanism, elementwise product is also used in place of absolute convolutional operations to increase the flexibility and reduce computational complexity.
In detail, CA first assigns weights to each channel by a global average pooling.The average pooling value of the c-th feature map, namely M c , can be formulated as follows: Then, two successive convolutional layers with activation layers are utilized as linear transformation to obtain a 1-D weight vector that elementwisely multiplies the cth feature map as follows: where C * 1 and C * 2 are the two convolutional layers, respectively, with δ(•) and σ(•) being the corresponding activation function.
Similarly, PA transforms the output feature maps of CA M * on a pixel scale with the output namely M • derived as follows: where C • 1 and C • 2 are the two convolutional operations, respectively, with σ(•) being the shared activation function.
Unlike the gated fusion module, which reduces the number of channels by one-third, the attention mechanism maintains an equal number of input and output channels.This suggests that the attention mechanism is able to effectively preserve the feature representation through channelwise interaction, leading to fine-tuning of pixelwise features with relatively low computational cost.In comparison to the approach presented in [21], which utilizes 1 × 1 convolutional layers at the beginning and end of the network, our method incorporates fully connected layers into the attention mechanism with elementwise product to further enhance the power of convolutional operations.

E. Loss Function
While a combination of L1 loss, L2 loss, SSIM, or perceptual loss as loss functions has been shown to achieve good performance in previous works [46], [47], [48], [49], our experiments on LFD-Net indicate that the most widely used L2 loss, is the most suitable loss function for LFD-Net.The L2 loss is defined as follows: where I is the input hazy image and J is the haze-free output.The intermediate value being approximated is K, which is not a direct output and thus introducing a natural discrepancy with the output from VGG, rendering it impractical to utilize perceptual loss.Furthermore, the small number of parameters in the proposed method minimizes the risk of overfitting, so regularization terms (i.e., L1 Loss) may even be counterproductive.

A. Dataset
To validate the dehazing effect of LFD-Net for space-airground integrated remote sensing land observation, we first conduct training (i.e., outdoor training set OTS) and validation ( i.e., synthetic objective testing set SOTS, hybrid subjective testing set HSTS) experiments of ground-based observation from the REalistic Single Image DEhazing dataset RESIDE [50].To validate the generalization ability of LFD-Net, we also use the real hazy and haze-free outdoor images dataset O-HAZE [51].We fine-tine the pretrained weights from ground-based observation data by using the aerial image dataset AID [52] for satellite (i.e., space-based) and drone (i.e., air-based) and test on the Remote sensing Image Cloud rEmoving dataset RICE [53].However, we lack a dataset to test the performance of the downstream perception task under hazy conditions.To solve this problem, we synthesize hazy images on DAIR-V2X [54] and VisDrone2019 [55], and evaluate the performance of object detection tasks using hazy and dehazed images for comparison.

B. Experiment Results
We faithfully reproduce seven methods for various outdoor scenarios, including DCP [7], AOD-Net [17], Grid-DehazeNet [29], Wavelet-U-Net [24], GCA-Net [42], FFA-Net [30], and D4 [56].All the experiments are conducted on a PC with an R9-5900HX CPU (E5-1650) and an NVIDIA RTX-3080 GPU.Quantitative comparison results on the outdoor SOTS and O-HAZE datasets can be found in Tables II and III, respectively.The visual comparison results from the outdoor SOTS and O-HAZE datasets are shown in Figs. 4 and 5. Furthermore, we also perform experiments using real-world hazy images with no reference both from HSTS and randomly selected images from the Internet, as depicted in Figs. 6 and 7.
In the remote sensing domain, to the best of our knowledge, pretrained models for dehazing methods are not publicly available.However, we also reproduce seven SOTA methods using default outdoor weights, including AOD-Net [17], Grid-DehazeNet [29], GCA-Net [42], FFA-Net [30], MSBDN [23], D4 [56], and DehazeFormer [57].As expected, the performance of AOD-Net is limited due to its small number of parameters, while the other methods show similar performance before finetuning.In this article, our pretrained model is open to the public for further comparison.
To demonstrate the effectiveness and efficiency of our proposed method, we present a comprehensive comparison using various metrics including PSNR, SSIM, CIEDE2000, ΔSSEQ, and FPS.The comparison results are summarized in Tables II, III, and IV.In addition, we provide a comparison of model sizes in Table V.
Observations reveal that many networks suffer from inconsistencies within color blocks or misrepresenting original information, as reflected in terms of CIEDE2000 and ΔSSEQ.For instance, lightweight methods such as AOD-Net [17] and D4 [56] produce relatively dark visual quality, resulting in a significant loss of texture information and making it difficult to distinguish objects for downstream tasks.DCP [7], a traditional dehazing method, exhibits relatively high dehazing capacity; however, it is susceptible to severe color shift as it heavily relies on prior assumptions about color distributions.While GCA-Net [42] encounters color shift occasionally in the synthetic SOTS dataset, it performs well in realistic scenarios like O-HAZE, which has thick and irregular haze.However, its halo effect and color imbalance are magnified in RICE1, which makes it less adaptive to generalized scenarios, as shown in Fig. 8. FFA-Net [30] performs well on specific datasets but distinctly lacks dehazing capability on RICE1, where there are a variety of landforms and terrains, rendering it not generalizable enough for shifted domains.
From the experiment, we can observe that incorporating attention mechanisms may prevent the image from being uniformly dehazed without region discrepancy (i.e., FFA-Net) compared to networks with absolute convolutional and concatenation layers (i.e., AOD-Net, LD-Net).However, a stack of sophisticated modules incorporating attention mechanisms may also confuse the model when selecting regions of interest, leading Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 4. Visual comparison on outdoor SOTS.We compare our methods with DCP [7], AOD-Net [17], GridDehazeNet [29], Wavelet-U-Net [24], GCA-Net [42], FFA-Net [30], and D4 [56].Our proposed method exhibits adaptability to diverse scenarios and possesses a noteworthy level of generalization.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 5. Visual comparison results on O-HAZE.We compare our methods with DCP [7], AOD-Net [17], GridDehazeNet [29], Wavelet-U-Net [24], GCA-Net [42], FFA-Net [30], and D4 [56].AOD-Net and D4 produce relatively dark in visual quality.GCA-Net performs well on irregular haze but suffers from inconsistency in color blocks.Our proposed method exhibits adaptability to diverse scenarios and possesses a noteworthy level of generalization.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 8. Visual comparison results on O-HAZE.We compare our methods with DCP [7], AOD-Net [17], GridDehazeNet [29], Wavelet-U-Net [24], GCA-Net [42], FFA-Net [30], and D4 [56].AOD-Net and D4 produce relatively dark in visual quality.GCA-Net performs well on irregular haze but suffers from inconsistency in color blocks.Our proposed method exhibits adaptability to diverse scenarios and possesses a noteworthy level of generalization.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE V COMPARISON OF THE PARAMETERS OF MODELS TABLE VI
EXPERIMENT OF LFD-NET OUTDOOR SOTS DATASET to insufficient attention paid to each hazy region and overfitting on specific datasets with limited data diversity, rendering these approaches not flexible enough for real-world vision tasks.
Nevertheless, attention mechanisms are well adapted to U-Net or U-Net ensembling structures, where multiscale features are addressed symmetrically.Wavelet-U-Net [24] and GridDe-hazeNet [29] have excellent performance, but they may come at the cost of inference time, 6.6× and 5.4× longer compared to our proposed method, respectively.Wavelet-U-Net transforms the image into the wavelet space using discrete wavelet transformation, which adds to the computational cost to some extent.GridDehazeNet also utilizes attention mechanisms but as a bridge of multiscale features, which ensembles the design of U-Net [22].It has three rows and six columns, with each row corresponding to a different scale, constructing a grid network, which may compromise the inference speed.
However, their performance on the HSTS dataset from Fig. 6 and randomly selected hazy images from Fig. 7 demonstrates that they may also suffer occasional degradation when dealing with remote objects that are occluded, as well as objects located in areas uniformly covered with thick haze but with limited prior semantic information.While the images randomly selected for our study in Fig. 7 may not be representative of specific datasets, they are still valuable for consideration as they reflect scenarios that can occur in real-world practices.Although accurately verifying the generalization of algorithms is challenging, our approach has demonstrated effectiveness even when encountering severe domain shifts, as evidenced by our experiments.Our proposed method does not adopt the U-Net structure for efficiency, nor does it leverage stacked attention mechanisms, which saves the computational cost to a large extent, exhibits adaptability to diverse scenarios, and possesses a noteworthy level of generalization.

C. Ablation Study
The experimental results confirm that our proposed LFD-Net is effective and efficient for real-time applications.Since it has a different principle than other methods, we perform a series of ablation studies to ensure that each component of the network is indispensable.The detailed experimental conditions and corresponding metrics tested on outdoor SOTS are listed in Table VI.
Inspired by [17] and [21], we add a second concatenation layer (i.e., Concat 2), to our method.In Case 1, we omit Concat 2 and observe a slight loss of detailed texture information due to the reduced high-level information.
In Cases 2, 3, and 4, we investigate the importance of the gated fusion module and attention mechanism in our model.These cases demonstrate that these two subnetworks work together to facilitate feature interaction.Specifically, the removal of the attention mechanism leads to the occasional appearance of black spots on the images, which significantly degrades the overall performance.In comparison with other lightweight methods, our method partially addresses this issue.In addition, the gated fusion module is a crucial component in enhancing the dehazing capability, serving as a bridge between the multilevel feature extraction process ending at the first concatenation layer Concat 1, and the attention mechanism begining at the second concatenation layer Concat 2.
When both the attention mechanism and the gated fusion module are involved, the detailed information in the images is further refined, making it more authentic and faithful to the original information.This structure helps to preserve and interact with multilevel information to improve the overall image quality.

D. Visualization Results
We have visualized the intermediate feature maps before and after the Gated Fusion module, as depicted in Fig. 9.As shown in (a), the incorporated convolutional layer combines features of three levels from Conv 1, Conv 2 and Conv 1 + Conv 3 to generate three distinctive feature maps.They are distinguished from each other by their focus on close or distant objects and the lightness or contrast of the pixels.
In Fig. 9(b)-(d), we demonstrate the changes in specific feature maps after the gated fusion module.Fig. 9(b) shows that the contrast of the image is enhanced with the hierarchical information, resulting in distant objects becoming more distinct.Fig. 9(c) and (d) shows more abstract feature representations, which are significantly shifted compared to the input features.Specifically, Fig. 9(c) emphasizes the outline of substances, while Fig. 9(d) highlights the blocks within substances.
The gated fusion module reallocates the distributed feature representations of the multilevel layers through featureinteraction strategies.The feature extraction process is compressed into three successive convolutional layers, for which we compensate by intralevel enhancement and interlevel combination.

E. Application for Object Detection Task
As a severe weather condition, haze can significantly reduce the effectiveness of remote sensing land observation system.
For instance, in autonomous navigation applications, object detection can be significantly impacted by hazy environments, resulting in degraded image quality and potentially jeopardizing the safety of the system.Therefore, preprocessing procedure for image enhancement before performing those tasks is of great importance.As far as we know, there is no dataset with built-in synthetic hazed images for object detection.In our experiment,  Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
we randomly select 100 images from each dataset (DAIR-V2X and VisDrone2019) and produce their synthetic hazy versions.We use the default outdoor pretrained weight for the former and the fine-tuned remote sensing pretrained weight for the latter.Both object detection processes are based on YOLOv5.Our experiment results show that the mean average precision when IoU = 0.5 (mAP@0.5) of the dehazed condition improves by 4.73% compared to the hazy condition in DAIR-V2X, while by 0.81% in VisDrone2019.
Furthermore, overall detection result of a particular scene is shown in Fig. 10(a), while Fig. 10(b)-(e) illustrates the most representative perspectives of the dehazing effect.In Fig. 10(b) and (c), it can be seen that dehazing improves the detection rate, as an additional car instance is detected in the dehazed condition compared to the hazy condition.In Fig. 10(d), the roadblock is mistakenly identified as a car in the hazy condition, but the dehazing method is able to remove this error.In Fig. 10(e), another car instance is shown before and after dehazing the synthetic hazed image.
In Fig. 11(a), we show the overall remote sensing object detection results from the perspective of a drone in a particular scene.In the images captured by the drone, the types of land cover are more complex and the objects are smaller when compared to the driving perspective from DAIR-V2X.Fig. 11(b)-(e) illustrates the difficulties object detection methods encounter when detecting smaller pedestrian instances, especially in hazy conditions.However, dehazing methods can partially address this issue and enhance the detection rate of small objects like pedestrians, as shown in Fig. 11(c) and (d).In Fig. 10(b) and (e), two additional pedestrian instances are detected after dehazing compared to the original conditions, similar to that in Fig. 10(e).Experimental results show that haze can have unpredictable effects on normal conditions, and our method can provide a better solution compared to the ground truth in representing high-level semantic information to some extent.

V. CONCLUSION
In this article, we propose a novel end-to-end model called LFD-Net for remote sensing image dehazing.As a preprocessing for downstream vision tasks, it not only ensures the effectiveness and efficiency required for real-time applications, but also outperforms SOTA methods in terms of region-balance and color-fidelity.By designing this framework, we demonstrate the potential of CNN-based networks by performing two-order spatial interaction.Specifically, we show that the capabilities of deep neural networks can be enhanced not only by adding more complex modules to be deeper, but also by effectively combining individual and natural feature extraction, fusion, and attention with feature interaction strategies, particularly in the field of image superresolution.The experiments on various scenarios also shows that performance of a model is not always proportional to the number of parameters, and less parameters to some extent may help mitigate overfitting, which might be conducive for future network design.

Fig. 2 .
Fig. 2. Architecture of the lightweight feature-interaction dehazing network.The reformulated ASM generates an explicit output by substituting the evaluated K value.The network primarily consists of convolutional layers and concatenation layers, with the use of elementwise product in the gated fusion module and attention mechanism.

Fig. 9 .
Fig. 9. Visualization results of the changes in layers before and after the gated fusion module.(a) represents the three output feature maps of the convolutional operation Conv 4 incorporated into Gated Fusion module, (b)-(d) stand for the changes in the 15th, 30th, and 31st feature map of the layers, respectively.(b) shows that the contrast of the image is enhanced, resulting in distant objects becoming more distinct.(c) and (d) show more abstract feature representations, which are significantly shifted compared to the input.Specifically, (c) emphasizes the outline of substances, while (d) highlights the blocks within substances.(a) The output of convolutional operation in gated fusion module.(b) The 15th Feature Map of Each Level of Concat 1 and the Output of Gated Fusion Module.(c) The 30th Feature Map of Each Level of Concat 1 and the Output of Gated Fusion Module.(d) The 31st Feature Map of Each Level of Concat 1 and the Output of Gated Fusion Module.

Fig. 10 .
Fig. 10.Reference object detection results.(a) Comparison of object detection results under ordinary, simulated hazy, and dehazed conditions.(b)-(e) Detailed subscenes of detection results under different conditions.(b) and (c) Demonstrate an improvement in the detection rate, detecting an additional car instance in the dehazed condition compared to the hazy condition.(d) Corrects the error of mistaking a roadblock for a car in the hazy condition.(e) Shows the detection of another car compared to the ground-truth clear image.

Fig. 11 .
Fig. 11.Reference remote sensing object detection results.(a) Comparison of remote sensing object detection results under ordinary, simulated hazy, and dehazed conditions.(b)-(e) Detailed subscenes of detection results under different conditions, in which the detection rate for pedestrians is enhanced to a large extent.In particular, (b) and (e) highlight instances of pedestrians that are not visible in the ordinary conditions but are detected after dehazing, similar to the results from DAIR-V2X.

TABLE I DETAILS
OF THE LFD-NET ARCHITECTURE

TABLE II AVERAGE
COMPARISON OF METRICS ON SOTS FOR 492 JPG IMAGES TABLE III AVERAGE COMPARISON OF METRICS ON O-HAZE FOR 45 JPG IMAGES

TABLE IV AVERAGE
COMPARISON OF METRICS ON RICE1 FOR 500 PNG IMAGES