Pyramid Spatial Context Features for Salient Object Detection

Salient object detection aims to obtain the most attractive objects from the input images, which severs as a pre-processing step for many image processing tasks. This paper presents a novel deep neural network design for salient object detection by formulating a pyramid spatial context module, PSC module for short, to capture the spatial context information at multiple scales. To achieve this, we ﬁrst adopt convolutional operations with different dilated rates to generate the feature maps with different respective ﬁelds, and then use the two-round recurrent translations to explore multiple types of spatial context features on these feature maps. By further inserting this module in a deep network, namely PSCNet, we are able to optimize the network in an end-to-end manner for salient object detection. We evaluate the proposed method on six public benchmark datasets by comparing it with 25 salient object detection methods. The experimental results demonstrate that our PSCNet performs favorably against all the other methods


I. INTRODUCTION
Saliency detection is one of useful pre-processing steps for lots of image processing tasks, including video compression [1], video abstraction [2], image editing [3], texture smoothing [4], object detection [5], [6], and few-shot learning [7]. It aims to find the most obvious objects from the input image and this task has been widely studied in the past years. Early works adopt low-level vision cues to identify the salient objects, where these visual cues include contrast, color, and texture [8]- [10]. However, the low-level cues lack high-level semantic information, which is important to distinguish the salient objects from the background. Hence, the recent works [11]- [15] leverage deep convolutional neural networks (CNNs) to learn to detect salient objects while more and more works [16]- [33] use both highly semantic features from the deep CNN layers and low-level features from shallow CNN layers for saliency detection.
Moreover, very recent works [35], [36] explore the non-local features for saliency detection by further reasoning the spatial context from the whole feature map. These context features are obtained by propagating the information from The associate editor coordinating the review of this manuscript and approving it for publication was Shaojun Wang. one pixel to the adjacent pixel and this process is repeated over the whole feature map. Hence, for each pixel, it can obtain the global context information from any position of the feature map. However, this process is performed on the feature map generated by a single convolutional operation and we can only obtain a single type of spatial context feature, which is difficult to capture the complex structures of salient objects and noisy background. Also, their results tend to lose some salient details or include background noises, as shown in Figure 1.
In this paper, we present to formulate a pyramid spatial context module, PSC module for short, to generate multiple types of spatial context features, which are able to capture the spatial context information at different scales, thus improving the performance of salient object detection by analyzing the fine-grain and coarse-grain characteristics of the salient objects. Our module first performs the convolutional operations with different dilated rates to generate the convolutional features with multiple respective fields, and then leverages two-round recurrent translations to each convolutional feature map individually to obtain multiple types of spatial context features. The spatial context features aggregated from the feature map with the small respective field tend to explore the fine-grain characteristics of the images, which helps to FIGURE 1. An example (a) shows that our method generates more accurate salient objects while PoolNet [29], BASNet [34], CPD [28], and PiCANet [35], [36] tend to loss a part of salient objects or include the background noises.
enhance the detail structures of the salient objects, while the spatial context features aggregated from the feature map with the large respective field tend to obtain the coarse-grain characteristics of the images, which helps to distinguish the salient objects from the background. By jointly using different types of spatial context information, we are able to capture the image features at multiple levels and produce more accurate detection results, as shown in Figure 1(b). Next, we insert the PSC module in a deep convolutional neural network to build the PSCNet for salient object detection. Finally, we use six benchmark datasets to evaluate the effectiveness of our method by comparing it with 25 state-of-the-art methods for salient object detection. Experimental results show that we achieve the best performance on all the benchmark datasets. We summarize the contributions of this paper as follows: • First, we formulate a new module, named as the pyramid spatial context (PSC) module, to capture the different kinds of spatial context features on the convolutional feature maps with multiple respective fields.
• Second, we build a novel deep convolutional neural network architecture, PSCNet, by adopting the PSC module to aggregate multi-level spatial context features and train the network in an end-to-end manner for salient object detection.
• Third, we compare our PSCNet with other 25 methods for salient object detection on six benchmark datasets by using three evaluation metrics. Experimental results show that our method outperforms other methods on all the benchmark datasets.

II. RELATED WORK
Salient object detection, especially based on the single input image, has been widely studied in the past years. Early methods adopt the hand-crafted features to detect salient objects and these hand-crafted features include color [37], [38], texture [39], [40], image contrast [9], [41], and other types of low-level visual cues [42]; please see [43] for a exhaustive survey. However, these hand-crafted features have limited feature expression ability, leading to a failure to find the salient objects in the complex environment.
With the development of deep convolutional neural networks (CNNs), lots of works leverage the highly semantic features learned from CNNs automatically for salient object detection. Recently, more and more works [16]- [33], [44] try to jointly use the low-level detail information and high-level semantic information at the different layers of CNN and achieve more accurate saliency detection results. For example, Li and Yu [20] used multi-layer features with different resolutions to find the semantic characters and the visual contrast of salient objects. Hou et al. [18] developed short connections to merge the features at multiple layers for salient object predictions. Zhang et al. [23] formulated a boundary preserved refinement strategy and integrated multi-layer feature maps to predict the saliency details. Hu et al. [19] recurrently aggregated deep features from a deep network to leverage the complementary information among multiple CNN layers. Deng et al. [17] explored the residual learning strategy to alternatively enhance the high-level features at deep layers and low-level features at shallow layers. Zhang et al. [22] designed a bi-directional message-passing model to select the useful features by a gate function. Zhang et al. [24] utilized an attention guided network to progressively and selectively integrate multi-level features. Wang et al. [26] developed a pyramid attention structure to better predict the salient objects. Zhu et al. [33] explored the local and global information of CNN by learning the attentional dilated features. Feng et al. [45] formulated an attentive feedback network for salient object detection by exploring the salient boundaries. Zhao and Wu [27] introduced the dilated convolution and channel-wise and spatial-wise attentions to use multi-scale features. Wu et al. [28] discarded the shallow-layer feature maps at shallow layers for acceleration. Wang et al. [30] iteratively integrated feature maps in the top-down and bottom-up manner. Qin et al. [34] developed a boundary-aware salient object detection network for salient object detection. Wu et al. [31] leveraged the advantages of multi-task learning to formulate foreground contour detection and edge detection tasks as the supervisions to enhance the network ability to predict more accurate salient objects.
Very recently, Liu and Han [35], [36] incorporated the global context feature to detect salient objects by adopting the spatial long short-term memory model. Zhao [15] adopts fully connected operations and Chen et al. [46] uses global average pooling to obtain global information. The fully connected operations and spatial long short-term memory model are time-consuming and global average pooling can only obtain a single feature for all the pixel on the spatial domain of feature maps. Beyond these works that aggregate a single type of context features, we present to generate multiple levels of spatial context features to capture both fine-grain and coarse-grain characteristics of the salient objects.  Moreover, please also refer [47] for a review of visual saliency detection with comprehensive information, [48]- [52] for salient object detection in RGBD images, [53] for salient object detection in optical remote sensing images, [54] for video salient object detection, and [55] for salient object detection in stereoscopic images.

III. METHODOLOGY
A. NETWORK OVERVIEW Figure 2 illustrates the overall architecture of our pyramid spatial context network (PSCNet), which uses a single image as the input and outputs a predicted saliency map in an endto-end manner. In the beginning, we adopt a deep convolutional neural network to generate the convolutional feature maps with different resolutions, and the feature maps with large resolutions at shallow layers help to discover the detail structures while the feature maps with small resolutions at deep layers help to generate the highly semantic features to distinguish the salient objects from the background. During the implementation, we take the DeepLab [56] as the feature extraction network, which keeps the resolutions of the last two layers unchanged and uses the convolutional operations with the dilated rates to enlarge the receptive fields. After obtaining the convolutional features, we leverage the proposed pyramid spatial context (PSC) module to aggregate multi-level context features, then combine the context features from different levels and also from the original convolutional features, and finally use a 1×1 convolution to reduce the number of feature channels and obtain F o . Note that we adopt batch normalization [57] and the ReLU [58] non-linear transformation after each convolutional operation. In the end, we predict the saliency map from F o as the output of the whole network. In the following subsections, we will first introduce how to obtain the spatial context through two-round recurrent translations and then elaborate on how to build our pyramid spatial context module.

B. SPATIAL CONTEXT FEATURES
To aggregate the spatial context features from the 2D feature maps, some works [70]- [72] present to perform the two-round data translations on the spatial domain of the feature maps to propagate the context information pixel by pixel. As shown in Figure 3, given the input feature map, we adopt the data translations of four directions (up, down, left, right), and for each pixel x i,j on the feature map (as the green point shown in Figure 3(a)), it can obtain the information from its four adjacent pixels: where ω up , ω down , ω left , and ω right are the parameters that can be learned automatically, and we set them as one at the beginning of the training process. The learned ω usually belongs to [0,1]. These recurrent translations at four directions are performed individually over the whole feature maps. After we combine the results of four directions, for each pixel, it can obtain the spatial context information from its row and column over the whole feature map; see Figure 3(b). Next, we take the feature map F m in Figure 3(b) as the input and perform the second-round recurrent translation on TABLE 1. Component analysis. Note that ''+SC'' indicates we only use spatial context in our network and ''PSCNet'' denotes we use the proposed pyramid spatial context to detect salient objects. the four directions, and combine the results as the output. At this time, each pixel can obtain the context information from the whole feature map, as shown in Figure 3(c).

C. PYRAMID SPATIAL CONTEXT FEATURES
The spatial context features obtained in Section III-B are largely affected by the adjacent pixel when the parameter ω is less than one. To generate multiple types of spatial context features at multiple scales with larger receptive fields, motivated by ASPP [56], we first adopt the convolution with different dilated rates to produce a set of feature maps. As shown in Figure 2, after we obtain the feature map extracted by the feature extraction network, we apply (i) a convolutional operation with the kernel size of one, (ii) a convolutional operation with the kernel size of three and dilation rate of six, (iii) a convolutional operation with the kernel size of three and dilation rate of 12, and (iv) a convolutional operation with the kernel size of three and dilation rate of 18. Also, we perform the batch normalization and ReLU non-linear operation at each convolutional operation. Hence, we can obtain four feature maps with different respective fields.
Next, we adopt the two-round data translations at four directions for each feature map to obtain the spatial context features at different scales. Further, we concatenate the spatial context features obtained from four feature maps and also combine them with the original convectional feature; see the orange block in Figure 2. Since the scales of the spatial context features are different, we name the aggregated features as pyramid spatial context features and name this module as the pyramid spatial context module. Finally, we apply a convolutional operation with the kernel size of one to merge the concatenated features and obtain the feature map F o with fewer feature channels, and the feature map F o is used to predict the saliency map as the output of the network.

D. TRAINING AND TESTING STRATEGIES 1) LOSS FUNCTION
Cross-entropy loss was used to optimize our PSCNet: where g and p denote the ground truth image and the predicted saliency map, respectively; g i,j is zero or one while p i,j belongs to [0, 1].

2) TRAINING PARAMETERS
The feature extraction part in our PSCNet (the orange blocks in Figure 2) was initialized by the weights trained on Ima-geNet [73], and other parts of our network were initialized by random noise. Next, we adopted Adam [74] to optimize the network by setting the first momentum value as 0.9, the second momentum value as 0.999, and the weight decay of 0.0005. We set the learning rate as 0.00005, reduce it to 0.000005 after 160k training iterations, and terminate the training process after 250k iterations. At the same time, we horizontally flipped the images for data argumentation and used the mini-batch size of one. Our method uses around six days to train the whole network a single TITAN X (Pascal) GPU.

3) INFERENCE
We use the predicted saliency map of our PSCNet as the output, and upsample it to the size of the original input image. Our method adopts 0.159s to process an input image on average on a single TITAN X (Pascal) GPU.

E. DISCUSSION
To obtain various types of spatial context, firstly, we adopt the convolutions with different dilated rates to obtain the convolutional features. When the dilated rate is zero, we mainly focus on the local regions, which helps to discover the fine detail information. When the dilated rate increases, we can obtain the convolutional features with large respective fields, which helps to discover the whole saliency region. Second, we take the convolutional feature maps with different receptive fields as the inputs to the two-round data translations to obtain the global spatial context. The spatial context aggregated from the feature maps with a small receptive field reflects the fine-grain characteristic of the salient objects, since each pixel of the feature map only contains the information of itself and only propagates its information to other pixels. In contrast, the spatial context aggregated from the feature maps with a large receptive field reflects the coarse-grain characteristic of the salient objects, since each pixel of the feature map contains the information of a local region and propagates the information of such a local region to other pixels. Hence, by combining the multiple types of spatial context features, we can capture the spatial context information at different scales and improve the performance VOLUME 8, 2020 Comparison results between our method (PSCNet) and other methods using the F β , S m , and MAE metrics. ''-'' indicates results that are not publicly available; We don't use any post-processing method, e.g., CRF.
of salient object detection by analyzing the fine-grain and coarse-grain characteristics of the salient objects.

IV. EXPERIMENTAL RESULTS
This section describes six benchmark datasets and evaluation metrics used for salient object detection and compares our PSCNet with other 25 methods for salient object detection.

A. DATASETS AND EVALUATION METRICS
Six benchmark datasets for salient object detection are used in our experiments: (a) ECSSD [39] includes 1, 000 images that have meaningful and complex structures. (b) PASCAL-S [59] includes 850 images that are selected from the PASCAL VOC2010 segmentation dataset [75] and there are several salient objects in each image. (c) SOD [60] includes 300 images that are selected from the BSDS dataset [76]  images, and we use the training set of DUTS as our training set to train the PSCNet by following the recent works on salient object detection [22], [24], [32], [36]. Then, three common metrics are used for the quantitative evaluation: F-measure (F β ), structure measure (S m ), and mean absolute error (MAE): where F β is the combination of the precision and recall, and we follow [18], [77] to set β 2 as 0.3; S m [78] is used to compute the structural similarity between the predicted map S and ground truth image G, and we use S o and S r to represent the object-aware and region-aware structural similarity and set α as 0.5; MAE denotes the average pixel-wise absolute difference between S and G, and we adopt W S and H S to represent the width and height of S or G. In general, the larger the F β or S m and the smaller the MAE, the better the result. The implementation of F β , S m , and MAE in [18], [78] are used to evaluate all the results.

B. ABLATION STUDY
We perform the ablation study to evaluate the components in PSCNet. Here, we build two baseline networks: one is the Deeplab [56], which is constructed by removing the Pyramid Spatial Context Module from the overall network architecture; another is the ''+SC'', which is constructed by adopting a single two-round recurrent translation to aggregate the context information from the features extracted by feature extraction network. Table 1 shows the results, where after leveraging the two-round recurrent translation to aggregate the spatial context, we obtain an obvious improvement, and our PSCNet achieves the best performance on all the benchmark datasets by formulating the pyramid spatial con- text module to aggregate the spatial context information in multiple levels. Moreover, we perform the experiments by using three rounds of data translations and one round of data translations to aggregate image context, where one-round data translation can only obtain the local context information along each row and each column and three-round data translations introduces more parameters to make the network difficult to train.

C. COMPARISON WITH THE STATE-OF-THE-ARTS
We compare our PSCNet with other 25 state-of-the-art methods for salient object detection; see the first column in Table 2 for the names of these methods. Among them, BSCA [69] and DRFI [9] use the hand-crafted features to detect salient objects and other methods all utilize deep convolutional neural networks to learn features for saliency detection. For a fair comparison, the results of the compared methods are obtained by using the saliency maps provided by authors or using their implementations to predict the saliency maps with the released training models. Table 2 shows the quantitative comparison results with the 25 methods on salient object detection in terms of F β , S m , and MAE on six benchmark datasets. Our PSCNet achieves the best performance on all the benchmark datasets by comparing it with these 25 methods. Especially, our method shows a large improvement on the DUTS-test dataset, which includes many challenge cases, demonstrating the strong capability of our PSCNet. In the future, we will explore the potential of our PSC module design for other layer separation tasks, such as mirror detection [79], lane marking detection [80], shadow detection [81]- [83] and removal [84], [85], reflection removal [86], [87], rain removal [88], haze removal [89], [90], etc. Figure 4 and Figure 5 present the visual comparison results on salient object detection. From these figures, we can see that other methods (d)-(h) may include the background noise or fail to find all of the salient objects. In contrast, our PSCNet is able to generate the saliency maps that are more consistent with the ground truth images (b). Particularly, our method can produce the accurate saliency maps in challenging cases, such as (i) small salient objects (see the second and fourth rows in Figure 4 and the third row in Figure 5), (ii) complex background (see the first, second, and last two rows in Figure 4 and the first, second, and fourth rows in Figure 5), and (iii) multiple objects (see the second to seventh rows in Figure 4 and the second and third rows in Figure 5). VOLUME 8, 2020

V. CONCLUSION
This paper develops a novel deep network architecture, named as PSCNet, for salient object detection by exploring multiple levels of spatial context features. Our key idea is to aggregate multiple levels of spatial contexts on the feature maps with different respective fields to obtain the fine-grain and coarse-grain characteristics of the images. As a result, we can produce more discriminative features to reduce the non-salient noise, to enhance the salient details, and to improve saliency accuracy. In the end, we evaluate our PSCNet on six common benchmark datasets by comparing it with 25 methods for salient object detection. Experimental results show that our method outperforms all the others, both visually and quantitatively. In the future, we will explore the potential of our PSC module design for other computer vision tasks.