Adaptive Dual-Stream Sparse Transformer Network for Salient Object Detection in Optical Remote Sensing Images

Excellent performance has been demonstrated by convolutional neural network (CNN) in salient object detection for optical remote sensing images (ORSI-SOD). However, the limitations of CNN's feature extraction using sliding window approach hinder the capture of global representations. Therefore, an end-to-end detection model, known as adaptive dual-stream sparse transformer network (ADSTNet), has been proposed for ORSI-SOD and is assisted by the vision transformer. It effectively addresses the compensation issue of global and local information in ORSI-SOD. In particular, an adaptive interaction encoder has been devised, amalgamating the multiscale sparse transformer and the pyramid atrous attention to constitute the adaptive dual-stream sparse encoder. This encoder collaborates with the CNN to enhance long-range dependency modeling and preserve global information more effectively base on local features. In addition, a directional feature reconfiguration is constructed to extract texture details from multiple directional dimensions. Finally, we propose the adaptive feature cascade decoder that synthesizes content information from the foreground, edges, and background to enhance the representational capacity of the image. Furthermore, a structural loss function, known as the weight compensation mechanism, is introduced to balance the performance of boundary and salmap segmentation losses. The proposed model has been demonstrated to outperform 26 state-of-the-art ORSI-SOD methods across eight evaluation metrics on two standard datasets, as evidenced by extensive experiments. Furthermore, to verify its robustness, the generalization performance of the model on the latest challenging ORSI-4199 dataset is reported.

The emergence of traditional methods and CNN has stimulated the development of ORSI-SOD [36], [37], [38].Traditional methods for SOD [26], [39], [40] often rely on low-level attributes such as color information content [41] and saliency feature analysis [42].However, they fail to generate accurate information representations for some deep and low-level features.In contrast, CNN can automatically learn features through large-scale data, and exhibiting stronger adaptability to complex scenes and noise.MCCNet [43] utilizes multiple content feature information for complementarity, and ACCoNet [44] employs multiscale information interaction.CNN is more adept at extracting local region features.As shown in Fig. 1, the local attention of a standard CNN structure tends to focus on neighboring features around a key point, making it difficult to capture global representations.In response to this issue, a number of CNN-based methods [16], [45], [46], [47], [49] have been proposed to capture a wide receptive field by utilizing deeper network architectures.They also explore global cues through different techniques such as global pooling or nonlocal modules.However, the adoption of deeper network layers unavoidably incurs considerable computational overhead, while maintaining the standard structure of deep neural networks can pose challenges in achieving long-range dependencies.
Therefore, we believe that a framework capable of global information stimulated the development of ORSI-SOD is also possible.VST [50] was a pioneer in introducing the transformer to the SOD, replacing conventional CNN models with self-attention mechanisms to explore global information.Subsequently, ASTT [51] designed the adaptive spatial tokenization module to mitigate the impact of optical image features on SOD and also employed the transformer to explore global information.These works have demonstrated the necessity of replacing CNN with transformer architectures to explore global information in the ORSI-SOD.Moreover, various transformer variants have been developed by researchers for other domains of SOD, resulting in substantial advances in RGB SOD [52], RGB-D/T SOD [53], [54], and Video SOD (VSOD) [55].However, transformers cannot extract local information as effectively as CNN in Fig. 1.This is because the lack of CNN's inductive biases results in less effective extraction of local information compared to CNN, leading to a degradation in performance.
Inspired by CNN-and Transformer-based approaches in SOD, it is worthwhile to explore the fusion of these two methods to achieve the maximum representation of SOD techniques.However, optical images captured from high altitudes are typically characterized by small and varying scales, presenting significant limitations when directly applying NSI-SOD methods to ORSI-SOD, resulting in unsatisfactory performance.In addition, incorporating boundary information as supervision can compel the network to learn more accurate pixel-level edge information, which is crucial in ORSI-SOD.Currently, there is no comprehensive CNN with transformer fusion architecture suitable for ORSI-SOD while incorporating boundary supervision.
In this regard, we propose the adaptive dual-stream sparse transformer network (ADSTNet), which effectively combines features at different levels from both local and global perspectives and achieves precise detection and localization of ORSI-SOD through boundary-guided assistance.The encoding stage employs a sparse framework that balancing the encoding of local region information and global object relationships.Specifically, one branch of the encode extracts spatial features using CNN are combined with global dependencies established by the adaptive dual-stream sparse encode (ADSE), interacting to alleviate discrepancies.Features at different levels are adaptively captured by this encoder, thereby enhancing the interpretability of the model results.To acquire more accurate representations of salient object boundary features in ORSIs, we introduce a directional feature reconfiguration (DFR) as a plug-and-play component, enhancing boundary information.It is noteworthy that, to the best of our knowledge, this is the initial attempt to apply dedicated boundary detection operators to the ORSI-SOD task.
Furthermore, we propose adaptive feature cascade decoder (AFCD) to guide the decoder learning process using boundary masks as explicit supervision.We construct a comprehensive loss function that balances boundary loss and saliency map loss through a weight compensation mechanism, further improving the accuracy and robustness of ORSI-SOD.In this manner, the ADSTNet network achieves the best results compared to 26 state-of-the-art (SOTA) models, achieving optimal performance in terms of S-measure, adaptive F-measure, adaptive E-measure, and other evaluation metrics.
Our contributions can be summarized as follows.

II. RELATED WORK
In this section, we begin by summarizing the work on ORSI-SOD.Subsequently, a concise overview of the advancements in vision transformer and the utilization of transformers in SOD is provided.Lastly, we elucidate the boundary detection operators employed in image processing.

A. Salient Object Detection for ORSI
Recently, extensive research has been conducted by scholars to address the various challenges faced by the emerging task of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
ORSI-SOD within the SOD community.Among these efforts, the CNN-based encoder-decoder structures have gained significant popularity [48], [57], [58], [59].Hou et al. [60] introduced deep supervision into SOD, implicitly enhancing the multiscale feature representation of salient objects.The implementation of this approach significantly enhances detection accuracy and has had a profound impact on subsequent CNN-based methods.Li et al. [61] extracted feature information from images at three different resolutions to optimize the detection drawbacks caused by varying object scales.Bai et al. [62] proposed a global-localglobal context-aware network to obtain the final comprehensive representation of salient objects in a spatially and semantically global manner.Furthermore, cross-scale interaction is achieved through an enlarged receptive field in the network proposed by Zheng et al. in [37], which utilizes dilated convolutions and attention mechanisms to capture potential fine-grained information.
In addition, some works have drawn inspiration from NSI-SOD and proposed strategies to incorporate contextual and boundary information to adapt to ORSI-SOD features, alleviating challenges in ORSI.For instance, the fusion of contextual features enhances the encoded representations [44], [63], exploring the contributions of boundaries, foreground, and background to global information through the complementary integration of multiple content information [43], and utilizing additional edge labels to improve the model's boundary perception capability [47], [64], [65].Despite the foundation laid by CNN-based ORSI-SOD models in improving performance, their performance is limited due to the constrained long-range semantic contextual relationships of CNN, as convolutions have a limited receptive field.To address this limitation, we propose a sparse transformer-assisted dual-stream encoder that enhances the global perception of local features and captures local details of global representation, thus compensating for the deficiency in capturing global information by CNN.

B. Vision Transformer
While CNN [66] have demonstrated excellent performance in visual tasks [67], [68], [69], [70], they are still constrained by the limitation of employing a strategy involving the gradual expansion of the receptive field through the use of local window movements, hindering effective modeling of long-distance relationships [72], [73], [74].In a parallel field such as natural language processing, another popular technique called transformer has emerged.Transformer leverages its self-attention mechanism to capture extensive global relationships and has achieved notable success.Recognizing the significance of global information in visual tasks, researchers have introduced transformers into image processing to overcome the limitations of CNNs, thereby mitigating the risks associated with compromising feature resolution and representational capacity.The effective integration of both aspects can significantly address the challenges associated with a singular focus.Some works [75], [76], [77] proposed to linearly combine CNN and Transformer to achieve the combination of local mechanism and dynamic attention.And [78], [79] proposed a dual-stream network based on CNN and Transformer to fully explore the representation ability of local and global pattern features in image classification.In addition, the authors in [80] constructed a dual-transformer with two parallel pathways, integrating pixel pathways and semantic pathways to enhance self-attention.In contrast, the authors in [72] built an interactive structure to achieve information exchange and joint feature learning between CNN and Transformer, fully learning the relationships between different positions.
The accomplishments of vision transformers in NSI-SOD have also been showcased for ORSI-SOD.For instance, a pioneering study by Liu et al. [50] presented a unified RGB and RGB-D SOD model based on a vision transformer achieving saliency and boundary detection by introducing task-specific labels.Wang et al. [52] proposed a transformer architecture consisting of an FCN decoder and three additional modules to capture salient local and global information in RGB images.The interplay of information from different modalities facilitates the learning of deeper information representations by the network model.To fully exploit the essence of different modalities, Liu et al. [53] introduced a dual-stream Swin Transformer equipped with spatial alignment and channel calibration modules, effectively integrating multimodal information and aggregating intralayer features.A transformer-based model was proposed by Zhang et al. [81] to capture implicit details and create the challenging RGBD COSAL1K dataset.This model incorporates two class labels to extract intrasaliency and intersaliency information, respectively.Moreover, a cross-reference transformer model that integrates appearance and motion cues from VSOD was presented by Huang et al. [55].Furthermore, Zhang et al. [82] presented a transformer-guided dual-stream structure that enhances information features through cascading.However, these methods are limited by a large number of parameters, making optimization challenging.In this article, we suggest utilizing a sparse transformer-assisted CNN to acquire global information while reducing noise caused by irrelevant information, thereby enhancing foreground-background discrimination.

C. Operators in Image Processing
In the field of digital image processing, operators play a crucial role as fundamental components.Among them, boundary detection operators, as one of the most central elements, have garnered widespread attention and research.Two types of commonly used edge detection operators exist, namely: 1) first-order derivative operators; and 2) second-order derivative operators.The first-order derivative operators include Roberts, Prewitt, and Sobel, while the second-order derivative operators include Laplacian [83].In recent years, boundary detection operators have regained importance in pixel-level computer vision tasks, such as camouflage object detection [3], [84], manipulation detection [85], and MISEG [12], gaining wide applications and research interests.Within this article, we utilize edge detection operators to construct the AFCD as an explicit mask extractor.The purpose of this is to guide the implicit feature learning process in ORSI-SOD.Our research is the first to apply boundary detection operators in ORSI-SOD, synthesizing high-quality information predictions from the feature maps transmitted by the backbone encoder, resulting in satisfactory results.This work not only expands the application scope of edge detection operators, but also provides new insights and approaches for research in the field of ORSI-SOD.

III. PROPOSED METHOD
In the present section, the proposed ADSTNet network is introduced.Subsequently, each component is described in order, as presented in Sections III-B to III-D.Finally, the loss function, which balances the weighted compensation mechanism, is elucidated.

A. Overview of the Proposed Architecture
The proposed ADSTNet follows an encoder-decoder structure, and its main framework is illustrated in Fig. 2, consisting of three parts: adaptive interaction encoder, DFR, and cascade decoder.Firstly, the input image I ∈ R 3×256×256 is fed into the stem to obtain initial fine features f stem ∈ R 64×128×128 .Then, adaptive interaction encoder is employed to capture more extensive local and global information.Here, adaptive interaction encoder comprises two components: the well-known Res2Net serves as the CNN encoder, extracting multiscale and multilevel local features f le , while the ADSE composed of pyramid atrous attention (PAA) and multiscale sparse transformer (MST) is designed to extract more comprehensive global information f ge .Specifically, Res2Net is divided into four stages, sequentially extracting information denoted as f t ce ∈ R h t ×w t ×c t , where h t being 256 /2 t+1 , w t being 256 /2 t+1 , c t = {256, 512, 1024, 2048}, t is the stage index and belongs to {1, 2, 3, 4}.Meanwhile, the features f ADSE generated by ADSE are also outputted for the primary saliency map ADSE with reverse supervision, forcing the learning of more accurate information and laying the foundation for subsequent feature refinement.The multilevel integration (MLI) eliminates semantic discrepancies between f le and f ge interactively, greatly enhancing the global perception ability of local features and the local details of global representations, delivering rich high-level semantic information f e to the AFCD for feature parsing.Due to the multiscale information captured by encoder, each detection head includes a 1×1 convolutional layer and upsampling to restore the resolution and obtain saliency maps.In addition, to enhance the auxiliary function of boundaries, we employ a mixed loss function assisted by a weighted compensation mechanism to implicitly assign clear boundaries to the saliency maps, ensuring plausible acceptance.We also show the inference details based on the proposed ADSTNet in Algorithm 1. Next, we will provide detailed explanations for each component.

B. Adaptive Dual-Stream Sparse Encoder
To address the dilemma of CNN getting trapped in global information extraction, we propose an ADSE that complements global and local information, as shown in Fig. 3. Specifically, the features from the stem are concurrently fed into two branches, each compensating for the other's deficiencies and progressively enhancing the missing information, thereby achieving the purification of high-quality and effective features.
To achieve a balance between computational efficiency and global information, we construct a new mechanism, MST, for capturing global information in ADSE.Similar to ViT, the encoding layer consists of a multihead attention layer and a feed-forward network (FFN).However, the multihead attention enriches the object information by transforming the received Q, K, and V information through dimensionality changes.The transformer information is then fed into the attention module to compute correlation scores.Finally, all the results are summed and enhanced using convolution, batch normalization, and ReLU operations to improve the feature extraction capability of remote sensing images and enhance information accuracy.Specifically, to maximize pixel information, we partition the image f stem into s × s patches, where s takes values of 4, 8, 16, and 32.As a result, the original image is resized to a size of HW s 2 with s × s and then flattened into a vector v i ∈ R s 2 ×c .Subsequently, a linear projection is utilized to transform each patch vector into the embedding e i ∈ R c that encodes the patch representation.Subsequently, the patches and positional encodings are fed into the encoder to obtain the output.
Furthermore, we introduce a sparse attention mechanism to reduce noise and additional computational overhead caused by irrelevant information.The sparse attention also aims to improve foreground-background discrimination and alleviate blurriness in the foreground edge regions, addressing the issue of blurriness in the foreground edge regions caused by the naive ViT's attention computation for all pixels.Inspired by [86], we conduct a sparse multihead attention (SMAT) and apply self-attention across channels instead of spatial dimensions to reduce time and memory complexity.We compute the similarity between pairs of reshaped queries and keys, considering only the k most similar pixel values, which leads to a more concentrated foreground and more discriminative foreground edge regions.This can also find an approximate match for a particular region or object in the image.We then normalize the k largest pixels in each row of the similarity matrix using softmax, setting other elements to zero, as derived below where T k (•) is the learnable top-k selection operator.Finally, the similarity matrix is multiplied with the values matrix to obtain the final result.Here, k is an adjustable parameter that dynamically controls the level of sparsity.It is obtained through a weighted average of certain fractions, and we set it to [1/2, 3/4].This dynamic selection enables attention to transition from dense to sparse, as derived below The prepared information is fed into residual FFN (RFFN) to complete the composition, achieving feature enhancement and extraction through linear projection.This reduces information loss in the sequence and effectively enhances semantic modeling capability.The aforementioned procedure can be delineated as follows: where f MST is the output of the MST, SMAT(•) is the SMAT, and RFFN(•) is the RFFN.
To assist in the preliminary extraction of multilevel information by the Transformer, we design a PAA.While transformers may discount local information extraction, we believe that incorporating additional information supervision can guide the learning process.On the other hand, although conventional convolutions extract more detailed information, they come at the cost of enormous computational complexity, contradicting our initial goal of achieving a balance between accuracy and real-time performance.Therefore, in this study, we leverage the power of pyramid hollow convolutions with four different receptive field sizes (r = 1, 3, 5, 7).To further enhance performance, we introduce a 1×1 convolutional layer for feature smoothing.Input: Optical RSI I ∈ R 3×256×256 , Mutil-Scale Patches mp (32,16,8,4) for MST Output: Salmap S ∈ R 1×256×256 1: //Step1: Information coding in Adaptive Interaction Encoder 2: while k in (1/2, 2/3, 3/4) do 7: end while 9: Then, we refine the features by summing up all the results and eliminating noise in the output.Finally, we combine the obtained features with the original information to facilitate information propagation.This process can be expressed as follows: where f PAA is the output of the PAA, conv 1×1 is a 1×1 convolutional layer, and conv r=1 is a convolution with the receptive field of 1.

C. Directional Feature Reconfiguration
To enhance the representation of boundary information, we propose a DFR, as illustrated in Fig. 4. It has been observed that salient images in remote sensing exhibit topological structures where objects such as buildings, ships, and airplanes are often arranged in a cluttered manner, deviating from horizontal or vertical orientations.Therefore, in addition to conventional horizontal and vertical boundary detection computations, we consider inclined boundaries to have significant influence on salient objects.Drawing inspiration from the utilization of the Sobel operator in traditional image processing, we design a dedicated gradient-based boundary detection operator to extract boundary information in four directions: 0 • , 45  Specifically, we construct four convolution kernels of size 3 × 3 with fixed parameters and apply convolution operations with a stride of 1.These aforementioned four convolutions are defined as follows: where, K x , K m , K y , and K n represent specialized operators for feature extraction along the horizontal, inclined at 45 • , vertical, and inclined at 135 • directions, respectively.The DFR is employed to obtain the boundary gradient maps by applying it to the output features of stage 1 and stage 4. The output of stage 1 exhibits finer features compared to the stem's output, devoid of rough information interference, while the features from stage 4 encompass rich semantic information of the overall image [64].Subsequently, we apply four basic convolution groups to the input features to obtain the gradient map G t xymn (t = 1, 4).Then, the features are smoothed using a 1 × 1 convolution and normalized through the sigmoid function to attenuate noise.Finally, the boundary-enhanced feature map is obtained by integrating the normalized features with the input features.The aforementioned procedure can be delineated as follows: where denotes element-wise multiplication, σ represents the sigmoid operation.The G t xymn is obtained by applying a specialized boundary detection operator on f t ce , where G x , G y , G m , and G n are concatenated along the channel dimension.The boundary information obtained from f t ce is represented by the variable f t ce .In particular, our initial step involves the application of a 1 × 1 convolution coupled with bilinear upsampling to the product of stage 4, thereby facilitating feature alignment commensurate Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.with the dimensions of stage 1.Subsequently, we utilize distinct 1×1 convolutions to ascertain uniformity in the channel dimensions of the two feature maps under consideration.This is succeeded by the implementation of a pair of convolution layers to derive the ultimate feature map.The aforementioned procedure can be delineated as follows: (8) where the variable ψ represents a convolution group consisting of 3 × 3 Conv, BatchNorm, and ReLU.f DFR denotes the output of DFR.To mitigate the influence of internal edge noise, we utilize the boundary information generated from the ground truth (GT) saliency map as a supervisory signal, disregarding the interference from internal boundary information.In addition, we employ a weight compensation mechanism to enhance the supervision on boundary information, better serving the compensation of information in the decoder.

D. Adaptive Feature Cascade Decoder
The boundary features obtained from DFR are utilized as a valuable source of prior knowledge to boost the image representation capability of the encoder.We propose the AFCD as illustrated in Fig. 5.The integration of boundary features enables AFCD to employ a cascaded structure that enhances the representation of both foreground and background features.This, in turn, facilitates the complementary fusion of multiple contents within the image.Specifically, AFCD consists of three inputs: 1) the prior boundary knowledge extracted by DFR; 2) the multiscale features from the encoder; and 3) the features from the upper-level AFCD.Within AFCD, three separate pathways are implemented, each dedicated to strengthening the feature representation in edges, foreground, and background.With regard to the boundary information, a fusion of the prior knowledge and encoder features is performed to acquire boundary-enhanced information.This process can be succinctly expressed as follows: where f edge represents the output from the fusion boundary and CNN coded information.For foreground information, f edge is aligned and fused with features f AFCD from the previous AFCD to strengthen the representation.Specifically, f AFCD is adjusted in scale using bilinear interpolation, followed by SA [87] and CA [70] attention mechanisms and three convolutional layers for cascaded fusion, enabling complementary fusion of multiple contents and enhancing information representation.Simultaneously, the output f AFCD from the previous decoder is reshaped and passed through a sigmoid function to obtain background information.Subsequently, three convolutional layers with batch normalization and ReLU activation are applied to obtain optimized features.This process can be described as follows: where the terms spatial attention and channel attention are represented by the acronyms SA and CA, respectively.ϑ denoted the 3 × 3 Conv, and f fg and f bg denote the feature of foreground and background, respectively.Finally, the aforementioned results are summed together to obtain the final output f AFCD , which includes foreground, background, edges, and features from the previous decoder.

E. Loss Function
Given that ADSTNet is a multitask model, addressing both interior and boundary segmentation, we introduce a comprehensive loss function to simultaneously optimize these two tasks.Moreover, a weight compensation mechanism is incorporated to facilitate effective feature learning.The definition of interior segmentation loss involves a weighting of both the cross-entropy loss (L CE ) and the mean intersection-over-union loss (L mIoU ).This combination is expressed mathematically as follows: where Gi and Si denote the GT and the predicted label for the ith pixel in image, respectively, and the total number of pixels in the image is denoted by N. Due to class imbalance between foreground and background pixels in boundary detection, the training effectiveness of our model on highly imbalanced datasets is enhanced by employing the Dice Loss.The Dice Loss (L Dice ) is expressed as follows: In summary, our designed overall loss consists of the salmap loss (L sal ) and the boundary loss (L bnd ).It is crucial to note that, with respect to the boundary detection loss, only the predictions generated by the DFR, which reconstructs directional features, are taken into account.On the other hand, for the primary image salmap loss, a deep supervision strategy is adopted to obtain predictions from decoder features at different levels.As a result, the total loss (L) is expressed as follows: where the weight factor γ is introduced, with γ value of three chosen to enhance the auxiliary role of the boundary and achieve dynamic balance, and the D is the number of AFCDs.A series of ablation experiments were carried out to investigate the optimal value of parameter γ.

IV. EXPERIMENTS
In this section, extensive experiments were conducted to evaluate the proposed ADSTNet.The datasets, evaluation criteria, and experimental settings are described in Section IV-A.A comprehensive comparison between our proposed model and all competing methods is presented in Section IV-B.Section IV-C provides ablation studies and related discussions.Lastly, an analysis of failure cases is performed.

A. Datasets and Implementation Details
1) Datasets: Our model was comprehensively evaluated on three datasets, namely ORSSD [16], EORSSD [17], and ORSI-4199 [65], to demonstrate its superiority.These datasets were annotated and provided convenience during model training.ORSSD is the first publicly available dataset designed for investigating saliency detection performance, consisting of 600 training and 200 testing images.The dataset encompasses a wide variety of object scales, types, and backgrounds.EORSSD serves as a supplement to ORSSD, enhancing the diversity and complexity of the dataset with 1400 training and 600 testing images.ORSI-4199 is a more challenging dataset compared to ORSSD and EORSSD, comprising 2000 training and 2199 testing images.To better train the model, we applied data augmentation techniques inspired by methods such as [44], [88].Specifically, we employed methods like mirror flipping and rotations at 90 • , 180 • , 270 • , resulting in an augmented dataset of 4200, 9800, and 14000 images for ORSSD, EORSSD, and ORSI-4199, respectively.
2) Evaluation Criteria: To objectively evaluate the performance of all models, we employed eight quantitative analysis metrics, namely S-measure (S α , α = 0.5) [89], max F-measure (F max β , β 2 to 0.3) [90], mean F-measure (F mean β ), adaptive Fmeasure (F adp β ), max E-measure (E max ξ ) [91], mean E-measure (E mean ξ ), adaptive E-measure (E adp ξ ), and MAE (M).Among these metrics, a smaller M value is preferred, while larger values are desirable for the other seven metrics.In addition, we utilized two qualitative indicators, namely the F-measure curve and the precision-recall (PR) curve, to visually illustrate the variations among the models through the tool [92].A model with a curve approaching 1 in the F-measure curve indicates superior performance, and likewise, a model's curve approaching (1,1) in the PR curve represents optimal performance.
3) Implementation Details: To maximize the performance of our model, we utilized Res2Net [56] as the initial weight for the backbone and resized each image to 256×256 as input.The evaluation of our model was conducted on a NVIDIA TITAN RTX GPU.We employed the Adam [93] optimizer with a learning rate of 1e-4 and a batch size of 8 on PyTorch [94].The model was trained for 50 epochs, and the learning rate was reduced to 0.1 every 30 epochs.To prevent exploding gradients during training, we implemented gradient clipping with a maximum norm of 0.5 using the clip gradient function of the optimizer.The performance of the resultant models in terms of saliency was subsequently assessed using test sets in the ORSSD [17], EORSSD [16], and ORSI-4199 [65].
1) Quantitative Comparison: The quantitative comparison results between ADSTNet and 26 other models on the EORSSD and ORSSD can be found in Table I.It is evident that ADSTNet demonstrates superior or competitive performance compared to other SOTA methods across all benchmark datasets.On the EORSSD, although our model falls behind ACCoNet in terms of F max β , it exhibits significant advantages in other metrics.Particularly noteworthy is the 0.0563 lead of ADSTNet over ACCoNet in F adp β .Similarly, when compared to DAFNet, which excels in E max ξ , our model shows room for improvement in M but achieves a comprehensive victory in other metrics, such as a 0.2105 improvement in F adp β and a 0.1235 improvement in E adp ξ .Furthermore, when comparing ADSTNet to VST and ASTT, both utilizing the transformer framework, our model slightly lags behind ASTT in M but emerges as the leader in other aspects, which is forgivable.Likewise, on the ORSSD, our model consistently maintains a top-three position in all comparative results.Specifically, compared to the second-best performing ACCoNet, our proposed model exhibits a marginal difference of 0.0058 in S α , but compensates significantly by leading with 0.0173 in F adp β .In addition, we present the comparative results of all methods on the PR curve and F-measure curve in Fig. 6.It is evident that our proposed model achieves Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II ELEVEN REPRESENTATIVE MODELS FOR FURTHER EVALUATION ON THE
ORSI-4199 DATASET satisfactory performance on the F-measure curve, both on the EORSSD and ORSSD, with curves closer to 1 compared to other models.On the PR curve, ADSTNet demonstrates a competitive performance with other comparative methods, but exhibits a surpassing trend, approaching the (1, 1) coordinate point in the later stages.Particularly on the EORSSD, ADSTNet stands out its peers, and the gap between ADSTNet and the first-ranked ASTT on the ORSSD is also minimal.
ADSTNet consistently demonstrates excellent performance on the ORSI-4199, mirroring its impressive results on the EORSSD and ORSSD, as shown in Table II.Our model achieves top rankings in five out of eight metrics, secures the second position in one metric, and attains the third position in two metrics.Specifically, compared to the highly competitive SUCA method, our model achieves parity in E mean ξ and outperforms it by 0.0240 in F adp β , albeit with a slight lag of 0.0084 in S α .Moreover, our method remains competitive across all metrics.For instance, in terms of F max β , we achieve 0.8698 (ours compared to 0.8560 (CorrNet), for E max ξ , we attain 0.9433 (ours) compared to 0.9369 (GateNet), and for M, we obtain 0.0318 (Ours) compared to 0.0357 (ERPNet).It is noteworthy that our model is the sole approach to surpass the threshold of 0.92 in E adp ξ .

2) Computational Complexity Comparison:
The computational of our method was evaluated based on three perspectives, which encompassed inference (I/O time network and FLOPs.Table I reports on the data acquired from the publicly available ORSI-SOD benchmark [43], [62], as well as our own retraining efforts.Upon evaluation, we discovered that the majority of CNN-based techniques could perform real-time (at a rate of 25-30 fps).In contrast, our method excels with an inference speed of 39.5 fps on the ORSSD, EORSSD, and ORSI-4199.The parameters of network are also at a middle level.However, compared with the second-ranked ACCoNet, our network has much better parameters, such as Params: 62.09 M (ours) versus 102.55 M (ACCoNet).Meanwhile, significant progress was made in the FLOPs competition, with only 6.62 G separating it from CorrNet, which ranks first on the FLOPs leaderboard.This achievement is noteworthy in the overall comparison.Compared with the third-ranked SUCA in ORSI-4199, our model still has a significant advantage in speed, parameters, and FLOPs, with an improvement of 15.5, 55.62, and 29.68, respectively.Based on the quantitative and computational complexity comparisons above, it can be inferred that our method is both highly competitive in the field.
3) Qualitative Comparison: As illustrated in Fig. 7, we present representative ORSI-SOD methods for each category, along with their corresponding timelines.Five different scenarios are compared, including morphologically regular buildings, elongated rivers with topographic structures, low-light conditions on complex shorelines, multiple tiny objects in complex backgrounds, and more challenging scenes.
Our model demonstrates satisfactory results in all five showcased scenarios.For the first scenario, our model excels in capturing detailed information in local regions, effectively highlighting salient objects compared to other models.However, ASTT and MSCNet show deficiencies in accurately localizing the overall contours of small buildings and suffer from misidentifications in certain instances (1st and 2nd instance).Our model, on the other hand, distinguishes itself by accurately discriminating salient regions of the building and providing well-defined boundaries, which is a key advantage over other models (3rd instance).
In the second scenario, which involves elongated rivers with irregular topographic structures and varying background colors, most compared models exhibit detection incompleteness and fail to capture global information or accurately represent the true width of the rivers.This deficiency can hinder subsequent processes in practical applications, as seen in GateNet and DSG.In contrast, our model overcomes these challenges, producing saliency maps that closely resemble the GT, with clear and distinct boundary information.
In the third scenario, which features rich boundary information, some methods suffer from blurred boundaries and detection omissions due to the interference of objects and surrounding scenes.Examples include ERPNet, CMC, PoolNet, PA-KRN, and HDCT.In addition, in the third example, where islands have elongated extensions, a challenging aspect, some methods such as ASTT overlook the importance of this easily neglected information.In comparison, our proposed method excels by achieving highly accurate detection, surpassing these obstacles.
The fourth scenario involves the detection of multiple tiny objects, a well-known challenge in remote sensing saliency detection where objects may be missed due to their small size.While some methods successfully detect all tiny objects, they inevitably make errors in boundary details.ASTT and MSCNet exhibit imperfect boundary information and produce blurry detections.In addition, ERPNet and PoolNet have omissions in detecting small vehicles.In contrast, our model showcases notable regional information, accurately capturing small objects in local regions of remote sensing images.
Lastly, the fifth scenario combines challenges from the previous four scenarios, incorporating various shapes of tiny objects and complex background noise.In traditional natural scene and remote sensing methods, such as CMC, HDCT, and DSG, there are instances where color-salient background regions are erroneously identified as salient objects, deviating significantly from Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.our expectations.Moreover, CNN-based methods are prone to mistakenly classifying object parts as background due to color similarities with the surrounding environment.These methods also encounter difficulties in detecting adjacent objects, leading to adhesive detections, as observed in MSNet, PoolNet, and PA-KRN.In contrast, our model minimizes the influence of complex interference factors and achieves satisfactory results.
Overall, the proposed model exhibits high detection accuracy for multiple tiny objects, low-light interference, and topographic structures.It excels in boundary delineation and outperforms other methods in capturing global information.

4) Compare With SOTA Methods of Semantic Segmentation:
To better illustrate the difference of ORSI-SOD compared to semantic segmentation models, we retrained six SOTA semantic segmentation models on the EORSSD and ORSSD using the authors' specified parameters.These models include DeepLabV3+ [30], HRNet [31], PointRend [32], Segmenter [33], SegNeXt [34], and SAN [35].As shown in Table III, although these six models demonstrated commendable performance on remote sensing datasets, a notable performance gap still exists when compared to our proposed ADSTNet.Specifically, among the semantic segmentation models, SAN achieved the best results, with S α scores of 0.9019 and 0.9176 on the two datasets, respectively.However, when compared to ADSTNet, a significant disparity remains, with differences of 0.0292 and 0.0203 on the EORSSD and ORSSD, respectively.Notably, on the ORSSD, ADSTNet surpassed the 0.009 bottleneck on the M, achieving a 55% gain over the Segmenter.
In conclusion, the dissimilarities in object localization between the two tasks hinder the effective alignment of semantic segmentation models with salient object detection tasks.Future efforts will focus on synergizing these tasks to design a universal model that can unlock greater value.

C. Ablation Studies and Related Discussions
To evaluate the effectiveness and indispensability of our proposed global-local-boundary information compensation scheme and key modules, we conducted extensive ablation experiments on the EORSSD and ORSSD.To ensure fairness in the experiments, each variant was retrained under the consistent experimental settings described in Section IV-A.
1) Loss Ablation: The hyperparameter γ is incorporated into the coarse prediction branch, specifically subitem L Dice , of the overall loss function (L).In contrast to the conventional loss weight parameters in [65], [51], and [64], we set γ to be greater than 1, aligning with our initial intention of constructing boundary detection.While mimicking the DFR for edge extraction, we inadvertently detected internal boundaries within objects, which deviated from our expectations.Therefore, we introduced a weight compensation mechanism to amplify the importance of boundary loss, forcing the DFR to primarily learn external boundary information and disregard internal fine details.In our investigation of γ ranging from 0 to 8, we observed that a value of 3 yielded the optimal performance, as shown in Fig. 8.In addition, we conducted an additional experiment with γ set to 0.5, and it was evident that its performance on ORSSD followed the overall upward trend, while exhibiting a dramatic anomaly on EORSSD.This indirectly confirms that γ does not belong to this range.Thus, setting γ to a value greater than 1 is deemed necessary.
2) Verification Process of the Individual Modules: In order to evaluate the effectiveness of the individual modules proposed, we defined a simple U-shaped baseline network with Res2Net [56] as the encoder and a decoder consisting of three consecutive convolutional layers for conducting ablation experiments.The verification process of ADSE, DFR, and AFCD was performed by employing a controlled variable method, where only one component was modified at a time, ensuring strict experimental operations.The numerical results for different combinations of these modules are presented in Table IV.In addition, visualizations are provided in Fig. 9.
Our proposed model demonstrates comprehensive optimization from both quantitative and qualitative perspectives.In quantitative comparisons, as shown in Table IV, the baseline network achieved only 0.9277 in S α , 0.8981 in F max β , 0.9724 in E max ξ , and 0.0110 in M on ORSSD.It is evident that the addition of ADSE  and DFR significantly improved performance.Specifically, S α increased to 0.9353 and 0.9374, showing an improvement of 0.0076 and 0.0097 compared to the baseline, respectively.While the improvements within individual components may appear marginal, the synergistic effects among these components yield a discernible enhancement.Specifically, on the EORSSD, the incorporation of ADSE alone results in a modest increase of only 0.003 in the F max β compared to scenarios without it (No.2 and No.5).However, the additional inclusion of both ADSE and AFCD surpasses the sole presence of DFR, exhibiting a substantial improvement of 0.0113 in the F max β .Similarly, on the ORSSD, although the introduction of DFR alone leads to a marginal improvement of only 0.0002 in M, the collaborative action of all components boosts M by 0.0024, a commendable achievement.We also conducted an experiment wherein we eliminated the ADSE while simultaneously introducing DFR and AFCD (i.e., No.6) in Table IV.This ablation of components translates to using only CNN as the encoding part.It is evident that, with the removal of ADSE's encoding information, and thus utilizing only locally acquired information, there is a notable Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.3) versus 0.9753 (No.4) versus 0.9807 These findings underscore the pronounced synergy among the proposed components, demonstrating that their collective impact exerts a more substantial influence on network performance.
From a qualitative standpoint, the model incorporating all components demonstrated superiority over individual elements in object saliency, as illustrated in Fig. 9.This aspect better complements the limitations of achieving limited performance.In the first example, the component with DFR achieved more accurate boundary delineation compared to the one without it.With the assistance of ADSE, the model captured more comprehensive global information, as observed in the second example.For irregular objects, especially those with sharp edges, such as the third example, the combined effect of boundary and global information resulted in a clearer overall object contour.In particular, the synergistic pairing of DFR with AFCD produces saliency maps characterized by enhanced boundary clarity, surpassing the clarity achieved with individual components.When solely relying on ADSE and AFCD without the incorporation of DFR, inaccuracies arise in outlining object boundaries.This nuanced observation underscores the formidable contributions of each component and highlights the substantial potential for performance enhancement through the collaborative interplay among these components. 3

) Effectiveness of Each Component in ADSE:
To validate the role of the transformer in guided learning, ablation studies were conducted on each component of ADSE.As depicted in Table V, although PAA can contribute to local information acquisition to some extent, its impact is limited due to the information loss during the dilation convolution process.The addition of PAA resulted in only marginal performance improvements in M for EORSSD (0.0003) and S α for ORSSD (0.0015), while exhibiting weaker performance in other metrics.However, the deficiency of PAA was effectively compensated under the supervision of saliency maps.Notably, this compensation in the significant improvements of F max β by 0.0011 on EORSSD and S α by 0.0055 on ORSSD.In the absence of the comprehensive global information provided by MST, the network's performance experiences varying degrees of decline as shown in Table V.For instance, on EORSSD, F max β exhibits a decrease from 0.8638 (No. 2) to 0.8676 (No. 3), and on ORSSD, M shows a reduction from 0.0104 (No. 2) to 0.0088 (No. 3).Applying implicit supervision to the global information extracted by MST leads to performance enhancement, notably exemplified by a 0.0099 improvement in E max ξ on EORSSD.Conversely, when MST is omitted alone (i.e., No.5), in contrast to the comprehensive ADSTNet with all components (i.e., No.6), there is a decrement of 0.0017 in M on ORSSD.These research findings underscore the collaborative contribution of each component within ADSE.Furthermore, introducing salient map supervision in the early stages proves to be a judicious and effective strategy, encouraging ADSE to assimilate more valuable information.In addition, the role of each component was further demonstrated through visualization, which can be found in Fig. 10.The constructive impact of MST on global information capture is evident; nevertheless, it falters in accurately localizing crucial objects.Simultaneously, PAA excels in delineating local object features.The introduction of MST and PAA enhances the model's capacity to capture comprehensive information.Specifically, the collaborative action of MST and PAA efficiently redirects attention from broad global contexts to essential objects, facilitating precise localization across all positions.The collective contribution of these components surpasses that of individual elements.However, the sparse nature of components imposes constraints on detailed information.Implicit supervision is introduced to address this limitation.Under this guidance, MST and PAA are compelled to refine and acquire more precise information features, yielding results more satisfactory than in previous scenarios.Various visual cues affirm that the proposed components significantly enhance the overall network performance, aligning with prior data analyses and reinforcing the critical importance of each component in cooperation.
4) Implication of the Number K: In the SMAT architecture we proposed, the choice of parameter k significantly influences overall performance, as illustrated in Fig. 11.Our findings underscore the critical significance of optimally choosing k to regulate boundary sparsity.Specifically, when k assumes a particular value, such as k = 1/2, or when not considering k (w/o Top-K), the overall network demonstrates heightened sensitivity to k, resulting in unsatisfactory performance.Consequently, we introduced a dynamic range to ascertain the optimal value of k, leading to performance surpassing that of using a singular fixed value.Throughout our experiments, we observed that a too-small k prevented the network from capturing comprehensive information, causing a notable performance decline.Conversely, when k was excessively large, the network incorporated irrelevant information and noise, placing a burden on performance.Through meticulous adjustments, we determined that optimal overall performance is achieved when k falls within the range [1/2, 3/4].Within this range, the network demonstrated performance for 0.0086 and 0.0065 on M and superior results of 0.9709 and 0.9740 on E mean ξ .These experiments validate the efficacy of dynamically selecting k, empowering the network to adapt more flexibly to various scenarios and datasets, thereby achieving a more robust and efficient optical remote sensing salient object detection performance.

5) Different Directions Role in the DFR:
The DFR ablation study was conducted to further demonstrate the necessity of capturing boundary information in different directions.We split the directions into conventional horizontal and vertical orientations, as well as oblique orientations of 45 • and 135 • while ensuring the completeness of information pairs.As shown in Table VI, without the assistance of boundary information in the K x K y and K m K n directions (i.e., No.1), the overall detection performance suffered varying degrees of degradation.Specifically, compared to the complete DFR configuration, the absence of K x K y boundary information led to a reduction of 0.0037 in F max β for EORSSD and 0.0094 for ORSSD.Similarly, E max ξ decreased by 0.0026 and 0.0084, respectively.However, there was still room for improvement compared to the variant that had K x K y but lacked K m K n .Specifically, there was minimal difference in performance on EORSSD, but a notable gap of 0.0020 in S α and 0.0034 in F max β on ORSSD.This indirectly confirms our observation that objects in remote sensing images are not strictly oriented horizontally.We further substantiated the criticality of our proposed components through qualitative comparisons.In Fig. 12, we present three instances and annotate the conventional indications of different detection directions.It is evident that information from any of these directions contributes to the model's performance enhancement.In the first instance, the extraction of boundary information at 90 • allows for a complete depiction of the object's overall contour in horizontal orientations.In addition, the boundary and contour information obtained under the joint effect of multidirectional boundary detection is superior to that obtained under the aforementioned single condition, with particular emphasis on the third scenario.It has been validated that proposed DFR can significantly contribute to the detection performance of the overall model.
6) Impact of DFR Information Sources: In further validating the impact of DFR information sources on the aggregation of boundary information, a series of experiments was conducted to explore optimal boundary information sources.As shown in Table VII, information from stages 1 to 4 was combined using various methods.The results indicate that under the condition of only two information sources, performance is superior when either f 1 ce or f 4 ce is essential, compared to scenarios where this information is lacking, such as No1, No2, No3, and No4.Specifically, when based on f 3 ce , the performance with f 1 ce is that S α was improved by 0.0045 on EORSSD and ORSSD.Similarly, when based on f 4  ce , F mean β is increased by 0.0210 and 0.0093 on the two datasets, respectively.However, when three information sources Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.ce , f 2 ce , f 3 ce , f 4 ce ) for boundary information did not yield satisfactory performance in all ablation experiments.This further affirms that not all boundary information in the encoder is suitable for aggregating texture features, and the effective combination of boundary information from lower levels and region information from higher levels significantly enhances network performance.In addition, exploration of combining f 1 ce and f 4 ce in a manner similar to FPN [71] to determine overall network efficiency was conducted, as shown in Fig. 13.It is evident that model performs better on benchmark datasets without using FPN, achieving improvements of 0.02 and 0.0123 on F mean β based on EORSSD and ORSSD, respectively.In the next steps of our research, we will continue to explore the optimal boundary information interaction mode suitable for this network to ensure smooth information transfer.

7) Influence of DFR With Plug-and-Play:
To further validate the portability of the proposed DFR module, we selected three network architectures from the ORSI-SOD method that simultaneously incorporate boundary information and region features, owing to the accessibility of their source code.The results are presented in Table VIII, demonstrating that DFR exhibits favorable plug-and-play characteristics.Transferring DFR to MJRBM, ERPNet, and EMFINet led to performance improvements.On the EORSSD, the F mean β increased by 0.0089, 0.0079, and 0.0060, respectively.On the ORSSD, the E mean ξ showed enhancements of 0.0137, 0.0024, and 0.0058.Notably, with the support of DFR, EMFINet surpassed the 0.01 threshold

TABLE IX ABLATION ANALYSIS FOR IMAGE SIZE
M on the ORSSD, establishing itself as a frontrunner.In addition, we presented the computational overhead introduced by transplanting the DFR module.While maintaining the original architecture, the parameter count increased by less than 1 M for all architectures.MJRBM experienced a modest increase of 0.15 M, attributed to the architectural similarity between ADSTNet and MJRBM, facilitating the seamless transplantation of the module.In conclusion, DFR exhibits plug-and-play characteristics and can contribute to performance enhancements in other SOTA networks.

8) Effect of Image Size:
To further investigate the impact of remote sensing image dimensions on saliency detection models, we conducted extensive experiments to explore the optimal training dimensions, aiming for enhanced model robustness.Initially, we analyzed the distribution of image sizes in the EORSSD and ORSSD, as illustrated in Fig. 14.Given that EORSSD is an expansion of the ORSSD, increasing from the original 800 images to 2000, the overall distribution is similar, with 256×256 being the most common size (1080 sheets), followed by 600×600 (1040 sheets).Taking this as a starting point, following [48], [51], we conducted experiments on sizes neighboring 256×256, namely 224×224 and 288×288 as shown in Table IX.It is evident that the performance of the 224×224 size is the poorest, while 256×256 and 288×288 show similar performance, with 256×256 exhibiting slightly better results.Achieving four top positions and three second positions in the experiments.Subsequently, experiments were conducted on the 600×600, revealing that the increase in image size did not lead to a significant performance improvement.This is because the upsampling of the majority of images to 600×600 introduces noninherent information, which can be considered as noise and interferes with image information representation.Larger images transformed into smaller ones might lose information but retain intrinsic details, resulting in less interference with the information representation during training compared to the noise introduced by upsampling.This also explains why the performance of 288×288 is slightly worse than 255×255.In summary, redefining the image size as 256×256 optimally unleashes the network's potential, demonstrating satisfactory robustness.

9) Effect of Date Augmentation:
To meticulously unravel the ramifications of data augmentation on ADSTNet's performance, a meticulously designed series of experiments was executed, aiming to rigorously validate the efficacy of this augmentation strategy.In our experimental framework, ADSTNet underwent its initial training phase on the pristine dataset, with performance metrics meticulously documented and visually depicted by the distinctive yellow bar in Fig. 15.Subsequently, an advanced phase ensued where data augmentation techniques were adroitly applied to amplify the breadth of the training set, facilitating a comprehensive model evolution, vividly represented by the across these datasets, adding layers of robustness and reliability to the model's repertoire.In summary, the judicious application of data augmentation emerged as a pivotal catalyst, facilitating the assimilation of a more nuanced understanding of diverse object features into the model.

10) Flexibility of Our Approach:
To substantiate the efficacy of our proposed method across diverse backbone networks, two variants, namely ADSTNet-VGG and ADSTNet-ResNet, were introduced.These variants utilized VGG [67] and ResNet [68] as encoding backbone networks, respectively, and underwent validation on the EORSSD and ORSSD.As presented in Table X, when compared to our initial method employing the Res2Net backbone (ADSTNet-Res2Net), these two variants exhibited slightly inferior performance, indirectly indicating the superior feature encoding capabilities of Res2Net over ResNet and VGG.Particularly noteworthy is that, while ADSTNet-ResNet demonstrated striking similarities in performance representation to ADSTNet-Res2Net, the comprehensive evaluation favored ADSTNet-Res2Net, securing the top position in all five evaluated metrics.This underscores the advantages of Res2Net.In summary, ADSTNet showcases robust adaptability to different backbone networks, adept at leveraging information obtained from various encoders and manifesting its intrinsic capabilities.

D. Analysis of Failure Samples
As stated previously, the aim of this article is twofold: first, to propose a novel framework for ORSI-SOD that effectively combines global and local information extraction, and second, to enhance the contribution of features at different levels throughout the entire image and improve the delineation of salient object contours through multidirectional boundary assistance so as to achieve accurate localization.However, ADSTNet still faces certain limitations when confronted with challenging scenarios as shown in Fig. 16.For instance, due to inherent biases, our model struggles to differentiate highly camouflaged noise information.In the first instance, the target object bears a striking resemblance to the surrounding distractors, and the connecting bridge seamlessly blends with the background color, resulting in minimal discernible changes.Although our model successfully detects the objects, it also exhibits instances of both false positives and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.false negatives, indicating the potential for further advancements in camouflage detection.In addition, effectively addressing complex shadows still poses a challenge for ADSTNet.Specifically, while the shadows on the fuselage can be effectively suppressed, those at the rear of the aircraft are inaccurately preserved in the second row of Fig. 16.Similarly, achieving comprehensive detection of salient objects still presents challenges.For instance, the bridge and winding river fail to be completely detected in the third and fourth rows of Fig. 16.This is primarily due to their close resemblance to the surrounding environment, making their complete detection a formidable task.These observations reflect the inherent challenges of the scene, which persist even in the latest network architectures combining transformer and CNN components, for instance, ASTT, ACCoNet, and CorrNet.

A. Effectiveness
The proposed ADSTNet in this article aims to acquire comprehensive image information and enhance information representation through progressive multilevel information interaction and boundary feature assistance, as confirmed by a series of experiments in Section IV.Such conceptualizations hold promise for bringing a more integrated detection framework to ORSI-SOD and are potentially transferable to other computer vision domains.However, there is still room for improvement in terms of real-time performance, particularly for deployment on airborne or spaceborne satellite equipment to realize greater value.

B. Hierarchical Information Adaptive Interaction
Acquiring more comprehensive information in the encoder is crucial as it significantly impacts the decoder's capacity to perform.This article introduces an adaptive iterative encoder, seeking to integrate local and global information complementarily.However, the current approach confines information matching solely to the terminal stage.Despite the supervision of matched information, certain limitations endure.Future work will concentrate on optimizing the fusion of both pieces of information, aiming to effectively diminish noise interference and irrelevant features during the fusion process.

C. Boundary-Assisted Optimization
The introduced DFR demonstrates effective adaptability to the complex features of remote sensing information, establishing a robust foundation for the accurate localization of regional information.While DFR can distill multidirectional boundary information, this process is currently manually defined.In the future, inspiration from domain-adaptive principles could be considered to develop an automated boundary information extraction module, tailored for a broader range of complex scenarios, thereby assisting various visual downstream tasks.

VI. CONCLUSION
In this work, we address the issue of compensating global and local information in the encoder-decoder framework and propose a novel end-to-end SOD model called ADSTNet.In the encoder, we introduce an adaptive interaction encoder that combines CNN with ADSE to capture multiscale feature processing enables a more comprehensive understanding of image details and contextual information.Moreover, the sparse attention mechanism selectively focuses on critical features, enhancing and the model's decision-making process can be interpreted with high accuracy.For the decoder, we propose an AFCD that dynamically adjusts and adapts the decoding process based on multipath input data.Furthermore, we introduce a plug-and-play DFR, which analyzes image information across multiple directional dimensions to extract distinctive features.This module assists the AFCD in compensating for diverse content information.In order to strengthen the learning ability of the model, we incorporate a structural loss function with a weight compensation mechanism.This loss function enhances the model's capacity to capture salient objects accurately.Comprehensive experiments on three benchmark datasets exhibit the advantages of our proposed model compared to 26 SOTA methods.Our model effectively combines global and local information, showcasing the effectiveness of each component.Despite the advantages of our approach, we plan to further refine our research by developing a lightweight transformerbased ORSI-SOD model.This model aims to enable practical deployment and achieve precise saliency detection, particularly in challenging scenarios such as camouflage situations.

Fig. 1 .
Fig. 1.Visualization of the local attention of CNN, global attention of Transformer and hybrid attention of the proposed ADSTNet in the feature space.It could be get that the hybrid could capture the more comprehensive and accurate information by compare above attention.Best viewed in color.

1 )
We propose an encoder-decoder structure, namely AD-STNet, which combines the strengths of CNN and Transformer to efficiently complement local and global information.With the support of boundary information, it achieves better feature extraction at different levels and enhances representation learning.In addition, we guide the model's loss through a weight compensation mechanism for implicit supervised feature learning, thus improving the model's robustness.2) We introduce the ADSE, which enhances the global perception of local features and local details of global representations in the adaptive interaction encoding.We also propose the plug-and-play DFR, which strengthens the representation capability of boundary information using a dedicated boundary detection operator.3) We design an AFCD that explicitly enhances the encoding feature representation through complementary learning from multiple contents.Intraclass and interclass consistency within the feature space are effectively captured by it.The rest of this article is organized as follows.Section II offers an overview of related research, while Section III provides a detailed description of the proposed model.Section IV reports on the comprehensive evaluation of our model, including ablation analyses and an analysis of failure cases.Finally, Section VI concludes this article.

Fig. 2 .
Fig. 2. Overall architecture of the proposed ADSTNet, including the ADSE, DFR, AFCD.Meanwhile, we guide the model's loss through a weight compensation mechanism for implicit supervised feature learning, thus improving the model's robustness.

Fig. 8 .
Fig. 8. Performance trends produced with different values of the hyperparameter γ.The curves of S α , F max β , E max ξ , and M on (a) EORSSD and (b) ORSSD.

Fig. 10 .
Fig. 10.Effective visualization of components in ADSE.Best viewed in color.

Fig. 11 .
Fig. 11.Ablation analysis for the implication of the number k in SMAT on (a) EORSSD and (b) ORSSD.

Fig. 13 .
Fig. 13.Ablation analysis for the information fusion of f 1 ce and f 2 ce by the neck like FPN.(a) EORSSD.(b) ORSSD.

Fig. 14 .
Fig. 14.Spatial analysis of image pixels chart on (a) EORSSD and (b) ORSSD.Please zoom in for better view.

Fig. 15 .
Fig. 15.Bar chart depicting comparative experiments with and without data augmentation on (a) EORSSD and (b) ORSSD.

Fig. 16 .
Fig. 16.Visualization results depict the failure cases encountered on the EORSSD, particularly with challenging scenes.
The boundary Extraction base-on stage 1 and stage 4 in CNN Encoder 14:

TABLE IV ABLATION
STUDY ON EVALUATING THE INDIVIDUAL CONTRIBUTION OF EACH CONTENT IN ADSTNET

TABLE V EFFECTIVE
CONTRIBUTION OF COMPONENTS IN ADSE, INCLUDING MST, PAA, AND SUPERVISION (SV) reduction in M. For instance, on EORSSD, it decreased from 0.0068 (No.5) to 0.0065 (No.6), and on ORSSD, it decreased from 0.0095 (No.5) to 0.0086 (No.6).Simultaneously, under the AFCD-based condition, combining the global information obtained by ADSE with CNN encoding information resulted in a notable enhancement of 0.0065 on EORSSD for F max β .Furthermore, ADSTNet, which incorporates all components, achieved comprehensive superiority.When comparing ADST-Net to individual components for DFR, ADSE, and this amalgamation also achieved superior results on F max β in ORSSD: 0.9713 (No.2) versus 0.9735 (No.

TABLE VI NECESSITY
OF CAPTURING BOUNDARY INFORMATION IN DIFFERENT DIRECTIONS Fig. 12. Visual comparison results of the capturing boundary information in different directions by DFR.Best viewed in color.

TABLE VIII ABLATION
ANALYSIS FOR THE TRANSPLANTABILITY PROPERTIES OF THE DFR