MPTC-FPN: A Multilayer Progressive FPN With Transformer-CNN Based Encoder for Salient Object Detection

Due to the development of Convolutional Neural Networks (CNN), significant progress has been made in Salient Object Detection (SOD). However, methods based on CNN are difficult to achieve good results in learning global context information. Recently, with the rapid development of vision transformer, it provides a new perspective for the performance improvement of salient object detection. Benefiting from the powerful capability of global modeling, transformer can supplement rich global contextual information. For lacking the ability to learn local details, it is suboptimal to only adopt transformer as encoder. Therefore, how to skillfully combine local details and global context information is crucial. We conbine CNN and transformer to propose a Multilayer Progressive FPN with Transformer-CNN Based Encoder For Salient Object Detection (MPTC-FPN). Similar to most of the previous methods, we adopt the FPN network as the basic structure. But the difference from previous methods is that we have six initial features before feature fusion, instead of the traditional four or five. We use a low-level feature generation module (LFGM) to generate a lower-level feature to supplement local details. In addition, we also propose a module to reduce the difference between features (DRM), making the features more conducive to fusion. On the basis of FPN, we add a large number of feature fusion nodes, which makes the process of feature fusion smoother. Moreover, we adjust the supervision strategy, use multiple supervision points, and adopt an appropriate weight distribution strategy among the multiple supervision points. A series of comprehensive experimental results demonstrates that our proposed method outperforms previous state-of-the-art methods on five datasets.


I. INTRODUCTION
content-aware image editing [5], robot navigation [6]. Tradi- 26 tional salient object detection (SOD) methods [8], [9], [10], 27 [11], [12], [36] mostly rely on hand-crafted features, such 28 as color contrast, boundary background. However, during 29 The associate editor coordinating the review of this manuscript and approving it for publication was Yiming Tang . saliency maps generating, the lack of high-level seman-30 tic information limits its accuracy. In recent years, the 31 rapid development of convolutional neural networks has 32 injected new vitality into the field of salient object detec-33 tion, which has greatly improved its performance compared 34 to traditional methods. In the field of salient object detec-35 tion, Encoder-decoder network architectures dominate. These 36 methods usually include two parts: encoder and decoder. 37 The encoder usually uses a pre-trained convolutional neural 38 network model as the backbone network to extract features 39 at different levels, such as VGG [14], ResNet [15]. Decoders 40 are usually carefully designed by researchers spending plenty 41 fusion nodes and adopt a layer-by-layer fusion method. The 98 purpose of this is to reduce the span between different level 99 features during the feature fusion stage, so that the process 100 of feature fusion is smoother. Simultaneously, we use the 101 proposed CAT module to fuse features at two different levels, 102 and reduce the number of channels between layers to spare 103 computational resource consumption. Because of the addition 104 of a large number of feature fusion nodes, we have more 105 supervision points to choose than FPN. Therefore, we adopt a 106 multi-supervised point strategy and use an appropriate weight 107 distribution strategy for supervised training. Which further 108 improves the accuracy of the final generated saliency map. 109 Our main contributions can be summarized as follows: 110 • A hybrid encoding method is adopted to combine  former and CNN.The low-level feature generation mod-112 ule (LFGM) is used to generate a lower-level feature 113 while the transformer is used to capture long-range 114 dependencies.

115
• Based on FPN, we propose a novel deep network struc-116 ture called MPTC-FPN. The structure of MPTC-FPN is 117 more suitable for multi-supervised strategies.

118
• A feature difference reduction module DRM is proposed 119 to reduce the gap between different levels of features and 120 make it beneficial to feature fusion.

121
• The CAT module is used for feature fusion, and a layer-122 by-layer progressive strategy is adopted during fusion. 123 In addition, in the feature fusion stage, we continu-124 ously reduce the number of channels to save computing 125 resources.

127
In this section, we will introduce some recent salient object 128 detection methods and the application of transformer in com-129 puter vision. The vast majority of traditional salient object detection meth-132 ods [8], [9], [10]   Zhao et al. [19] proposed a multi-context deep learning 159 framework that uses both global and local context modeling.  Transformer [16] was first proposed in the field of natu-199 ral language processing and applied to machine translation. 200 In many natural language processing tasks, it has achieved 201 remarkable results. Dosovitskiy et al. [17] first introduced 202 Transformer to the field of computer vision and achieved 203 state-of-the-art methods on multiple standard datasets for 204 image classification. Compared with CNN-based methods, 205 their proposed Vision Transformer (ViT) requires less com-206 putational resources [17]. Wang et al. [26] proposed the 207 In order to solve the above problems, we adopt the form

237
In this section, we will describe our proposed module. In the 238 first part, we give an overall overview of the proposed net-239 work structure. In the second part, we will describe the 240 encoder part in more detail, especially the hybrid coding 241 strategy adopted. In the third part, we will introduce the 242 decoder part and the modules used in detail. In the fourth part, 243 we will elaborate on the proposed DRM module at length.

244
In the final fifth part, we give a brief introduction to the loss 245 function we use. A more intuitive representation of the entire 246 network is shown in Figure 1. As we mentioned earlier, high-level features contain seman-259 tic information, which can precisely locate salient objects 260 or regions. The low-level features are rich in local detail 261 information, which can well complement the local details in 262 the generated saliency map. In previous works, most of the 263 methods used five-level features extracted from the back-264 bone network, and then used a well-designed decoder for 265 feature fusion, and finally generated a better saliency map. 266 For some reasons, some methods abandon the use of the first-267 level features, and use the fourth-level features extracted by 268 the backbone network, and then perform feature fusion to 269 generate the final result.

270
While our encoder structure is different from the previous 271 methods, in this part, we adopt a hybrid encoding method. 272 These include the five-level features encoded with Trans-273 former and our newly added one-level features. Transformers 274 use self-attention to capture long-term dependencies in the 275 data, which has important implications for capturing global 276 contextual information. Swin Transformer constructs hierar-277 chical feature maps, and this hierarchical architecture reduces 278 the computational complexity related to image size to linear. 279 This greatly improves computational efficiency and can serve 280 as a general computer vision backbone. We choose Swin-B 281 pre-trained on the ImageNet-1K dataset [29] as the backbone 282 network. The image input size is 384 * 384, and the five-level 283 feature maps extracted by the backbone network are 96 * 96, 284 48 * 48, 24 * 24, 12 * 12, and 12 * 12 respectively. The number of 285 channels is 128, 256, 512, 1024, 1024, respectively. We label 286 these five-level feature maps as F2, F3, F4, F5, and F6, 287 respectively. To refine the generated saliency map, supple-288 menting local spatial details, we introduce a lower-level fea-289 ture. We adopt low-level feature generation module (LFGM) 290 to generate a feature map of size 192 * 192 and the number 291 of channels is 64. This lower-level feature contains a large 292 amount of local detail information, which is complementary 293 to the powerful global modeling ability of transformer. This 294 good complementary form can not only accurately locate 295 salient objects and regions but also supplement local spa-296 tial details during feature fusion. Therefore, higher-quality 297 saliency maps can be generated, which greatly improves the 298 accuracy. We label this lower-level feature as F1. LFGM can 299 be expressed as follow: In the decoder part, we added a large number of feature 309 fusion nodes, and formed a layer-by-layer progressive struc-310 ture. Because of the introduction of feature F1 in hybrid 311 encoding, the depth of our network is further deepened. From 312 six feature nodes in leftmost side of network structure, it is 313 VOLUME 10, 2022  After upsampling and downsampling, there will be three dif-363 ferent sizes of features. Then we use the atrous convolutional 364 layer to process the features after upsampling, and use the 365 asymmetric convolutional layer to process the features after 366 downsampling. For the original size feature, we use original 367 convolution for processing. Then, we restore the previously 368 up-sampling and down-sampling features to their original 369 size using down-sampling and up-sampling, respectively. The  The final loss function is defined as L total The overall loss 402 function we use is as follows: We adjust the weight ratio of different supervision points.

406
In our paper, we set α to 0.6. The binary cross-entropy loss 407 can be formulated as: SM pred and SM gt represent the generated saliency map and 427 ground truth, respectively. Our goal is to reduce the loss 428 function as the number of training epochs increases.

430
In this section, we will describe the content of the five 431 subsections. These include experimental details, selection of 432 datasets, evaluation metrics, performance comparisons, and 433 experimental studies of ablation. We will demonstrate the 434 superiority of our proposed method by comparing our method 435 with some previous state-of-the-art methods. In addition, 436 we conduct a series of ablation experiments to explore the 437 impact of each module or strategy used by our proposed 438 MPTC-FPN on the experimental result. In addition, for the 439 training process of the model, we draw the loss function 440 convergence curve, as shown in Figure 3. The proposed approach is implemented by the Pytorch. The 443 SGD optimizer [32] with weight decay of 5e-4 and momen-444 tum of 0.9 is adopted to optimize the network. We use a warm 445 up strategy, and warming up epochs is 6. Meantime, poly is 446 adopted to adjust the learning rate. The learning rate change 447 VOLUME 10, 2022 488 P denotes the predicted saliency map and G denotes the 489 ground-truth. H and W are height and weight of the image.

490
(2) F-measure is denoted as F β . It is computed by the 491 weight harmonic mean of the precision and recall. F β can be 492 formulated as As with the previous work, we set β 2 to 0.3 to emphasize the 495 importance of precision. And we report the max values of F β . 496 (3) Weight F-measure [38] is denoted as F ω β . It can be 497 defined as F ω β uses weighted precision and weighted recall to measure 500 the accuracy of different models, where F ω β is also set to 0.3. 501 (4) S-measure combines the region-aware(S r ) and object-502 aware(S o ), S m [39] and focuses on measuring the overall 503 structure similarity. It has the following formula    Some predicted saliency map of the proposed saliency 548 method (MPTC-FPN) and other seven state-of-the-art meth-549 ods have been shown in Figure 4. In the first and second rows, 550 in the detection scene of small objects, our method can find 551 salient objects more accurately. In the third row, MPTC-FPN 552 can effectively distinguish salient object regions, even when 553 the contrast between salient objects and background is low. 554 In some scenes with complex backgrounds, such as the fourth 555 In addition, in Figure 5  We further explored the weight distribution parameters 618 between multiple observation points, and the results are 619 shown in Table 4. We found that the total loss with dif-620 ferent α have different effects on final saliency results. 621 From Table 4, when the parameter α is 0.6 gets the best 622 result.

6) COMPARISON OF DIFFERENT LOSS FUNCTIONS
On the basis of the best results obtained previously, we con-625 duct ablation studies on the loss function used. We adopted