Salient Object Detection Using Recurrent Guidance Network With Hierarchical Attention Features

Fully convolutional networks (FCNs) play an significant role in salient object detection tasks, due to the capability of extracting abundant multi-level and multi-scale features. However, most of FCN-based models utilize multi-level features in a single indiscriminative manner, which is difficult to accurately predict saliency maps. To address this problem, in this article, we propose a recurrent network which uses hierarchical attention features as a guidance for salient object detection. First of all, we divide multi-level features into low-level features and high-level features. Multi-scale features are extracted from high-level features using atrous convolutions with different receptive fields to obtain contextual information. Meanwhile, low-level features are refined as supplement to add detailed information in convolutional features. It is observed that the attention focus of hierarchical features is considerably different because of their distinct information representations. For this reason, a two-stage attention module is introduced for hierarchical features to guide the generation of saliency maps. Effective hierarchial attention features are obtained by aggregating the low-level and high-level features, but the attention of integrated features may be biased, leading to deviations in the detected salient regions. Therefore, we design a recurrent guidance network to correct the biased salient regions, which can effectively suppress the distractions in background and progressively refine salient objects boundaries. Experimental results show that our method exhibits superior performance in both quantitative and qualitative assessments on several widely used benchmark datasets.


I. INTRODUCTION
As a common preprocessing step for various computer vision tasks, salient object detection aims to locate the most prominent areas in an image, which is widely used in image segmentation [1], visual tracking [2], image retrieval [3], and video compression [4], etc. Hence, it has been received widespread attention from researchers.
Recently, a rich set of deep saliency methods have been proposed in the literature. Benefiting from the powerful feature extraction capabilities of convolutional neural networks (CNNs) [5], the majority of early works used CNNs to extract The associate editor coordinating the review of this manuscript and approving it for publication was Omid Kavehei . multi-level features of images. After performing some operations such as covolution, pooling, and activation on features, these features were fused using fully connected layers to predict saliency scores [6], [7]. Unfortunately, due to a large number of parameters contained in fully connected layers, traditional CNN-based models suffered from high training computational complexity. To alleviate this limitation, Long et al. discarded fully connected layers in CNNs and proposed fully convolutional networks (FCNs) [8]. FCNs have greatly facilitated the development of saliency detection and become basic frameworks for most of saliency detection models.
Compared with the conventional CNN-based models, FCN-based models become more popular because they can accept any size of input to achieve end-to-end training. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ The convolutional features extracted by FCNs are mainly divided into two parts: low-level features encode detailed information and high-level features contain semantic information, both of which are essential for saliency detection. However, how to utilize these features effectively is a crucial issue that needs to be addressed. Early FCN-based models [9], [10] predicted saliency maps by using high-level features while ignoring low-level features, resulting in coarse salient object boundaries. Therefore, many approaches [11]- [13] are devoted to leveraging low-level and high-level features jointly for saliency detection, but there are still some drawbacks. On the one hand, the integration of low-level features with high resolution leads to increase the computational complexity of models. For instance, in [13], shortcut connections are used to achieve the fusion of deeper side outputs to shallower side outputs. However, the complexity of this model increases and the performance tends to saturate during the gradual fusion of features from high-level to low-level. On the other hand, features yielded in these ways are short of affluent context information which can obtain the interconnection between each object in images. Considering the difference of information contained in the two levels of convolutional features, we perform a hierarchical operation on them. In order to filter out distractions mixed in detailed information, we introduce a feature refinement module for low-level features.
Meanwhile, to capture context information, atrous convolutional layers are adopted to output multi-scale features.
In fact, features are not equally important in terms of spatial location or channels. Existing works [14], [15] use attention mechanism to distinguish differences in the two aspects. This mechanism can not only suppress undesired distractions, but also concentrate attention on saliency regions. Therefore, we utilize a two-stage attention module to achieve locating position precisely. Even though attention mechanism is beneficial for salient object detection, biased attention features that result in inaccurate prediction may be generated when hierarchical attention features are fused. In recent works, recurrent neural network (RNN) has shown surprising performance in saliency detection owing to its memorability. Many RNN-based methods exploit convolutional layers as the basic unit to refine features [10], [15]. Although these models have better performance than CNN-based ones, they have poor ability to adaptively forget interference features and select useful features. One of the solutions is to adopt gate control structures that are able to control the transmission of information. The gate control structures have the ability of keeping essential features and discarding useless features, which contributes to learning long-term dependencies. Inspired by this structure, we propose a recurrent guidance network which utilizes the gate control structure of LSTM [16] to progressively refine attention features. Since the proposed recurrent neural network maintains the dependency on previous time step and refines the boundaries of salient objects in a guidance way, as a consequence, more accurate saliency maps can be obtained.
Considering the effectiveness of hierarchical processing and the advantages of recurrent networks, in this article, we propose a recurrent guidance network with hierarchical attention features for salient object detection, which can accurately detect salient objects.

II. RELATED WORK
In the past two decades, many saliency detection methods have been developed, which yield a qualitative leap in this field. In this section, we briefly review some works related to the component of our method.

A. MULTI-LEVEL FEATURES BASED METHODS
Multi-level features are proved to be beneficial for predicting saliency maps. Low-level features contain detailed information for reconstructing the edges of objects, while high-level features contain semantic information for determining the classification of objects. Many existing methods proposed different strategies to leverage multi-level features. In [7], the visual saliency model is constructed by introducing a neural network architecture with fully connected layers, which uses three different scale features extracted from nested windows to predict the saliency degree of each superpixel. However, the method based on patch-by-patch scanning is extremely computational expensive. In addition, it brings about the loss of global information, because the spatial information in images cannot be propagated by fully connected layers. To attack these problems, most of saliency detection methods use FCNs to extract multi-level features and process multi-level features by pixel-level operation. NLDF [17] merges multi-level features and local contrast to obtain local features, and then combines local features with global features generated by the top layer that is a score processing module to generate the final saliency map. In [12], multi-level features are directly concatenated from the perspective of different resolutions and channels. After multiple aggregated features are further processed in a top-down manner, boundary improvements are introduced at each aggregated feature. Besides, Zhang et al. [18] proposed a gate-controlled bidirectional structure to integrate multi-level features with rich contextual information, which realized bidirectional transmission of information. The above methods use the same process for all level features, but the obvious difference between low-level features and high-level features indicates that these unified processing schemes cannot fully exert the advantages of each level feature.

B. ATTENTION MECHANISMS
Attention mechanism can distinguish the importance of each region and refine convolutional features by suppressing distractions in background. In general, attention mechanism is mainly divided into three categories: spatial attention, channel attention, and hybrid attention. The essence of spatial attention is to more accurately locate the target by using the weight of spatial position. In [19], unlike traditional spatial attention mechanisms, a reverse attention (RA) block which emphasizes non-target regions through deeper output is embedded to guide residual saliency learning. The channel attention mechanism takes into account the importance of each feature, so that models based on channel attention mechanism can concentrate on useful channels and enhance discriminative learning capabilities. For example, SENet [20] adopts the channel attention mechanism to achieve the top one in the 2017 ImageNet classification competition. In fact, hybrid attention mechanism is able to obtain discriminative features effectively. Hence, most of saliency models prefer to utilize hybrid attention mechanism. Wang et al. [21] combined the pyramid multi-scale feature extraction module with the attention mechanism. The pyramid attention features not only have a larger receptive field, but also pay more attention to salient regions. However, this method only uses the spatial attention mechanism. Intuitively, both spatial and channel mechanisms have a positive effect on convolutional features. Motivated by it, Zhang et al. [15] proposed a layer-wise attention module that reasonably combines spatial attention and channel attention to achieve accurate locating of salient regions.

C. RECURRENT NETWORKS
Recurrent networks can not only further adjust salient areas, but also refine the boundaries of salient objects by iteratively correcting the deviations of features in previous time steps. Many methods use recurrent networks to achieve feature refinement. RFCN [10] transmits the output of previous stage to the next stage to achieve the refinement of features. Due to the output saliency map concatenate with the original image each time, the saliency prior effect is weakened. Hu et al. [22] recurrently combined aggregated multi-scale features with the features of each distinctive layer in order to progressively highlight salient regions and reduce the non-salient cues. Unfortunately, the final saliency map fusion method may produce extra regions. Different from the strategy of directly fusing the complete saliency maps, Kuen et al. [23] proposed a recurrent attention convolution-deconvolution network (RACDNN) that employs a refined strategy to integrate saliency maps of sub-regions. Spatial attention is implemented by spatial transformer and recurrent network refins the corresponding areas of entire prediction maps. Although RACDNN refines multiple sub-regions, the overlap of sub-regions may cause redundancy in the saliency map. In [15], Zhang et al. designed a progressive attention guided module that selectively integrates multiple contextual information and a multi-path recurrent feedback module that transmits global semantic information from top to shallower layers to predict saliency maps. These methods have improved performance by using recurrent networks, but there is still a room for improvement.

III. THE PROPOSED METHOD A. MOTIVATION
Multi-level features have advantages over the single-level features in saliency detection methods [13]. However, the indiscriminative operations of all level features may be not suitable for the distinctive features with different information. To obtain effective features with rich contextual and detailed information, our model introduces feature refinement module (FRM) and multi-scale feature extraction module (MFEM) to hierarchically process features. A two-stage attention module is also exploited to distinguish the importance of features in the channel and spatial location. Then, hierarchical features are fused using convolutional layers. Many existing recurrent saliency methods [10], [15], [22], [23] use the convolutional layers as the basic unit, which has the weak ability of feature selection and forgetting useless features due to the absence of structure that can control information propagation. To overcome this limitation, a recurrent network guided by the previous time step is designed to further adjust saliency regions and refine salient objects boundaries by using the gate control structure of LSTM. Fig. 1 illustrates the overall framework of our proposed method, which takes an image as input and outputs a binary saliency map in an end-to-end fashion.

B. HIERARCHICAL FEATURES EXTRACTION
To reduce the computational complexity of the proposed model and achieve pixel-wise saliency detection, we use FCN to extract multi-level features. Considering that all the information involved in multi-level features are beneficial for salient object detection, we perform hierarchical operations on these features, which are mainly divided into low-level and high-level features. The details of hierarchical features extraction are as follows.
For an image I with size of w × h, features extracted from five convolutional blocks of FCN are represented as F i (i = 1, 2, 3, 4, 5), which are denoted as '' Fig. 1. The first two blocks output low-level features with high resolution that contain detailed information, and the remaining three ones extract low-resolution high-level features with semantic information. Meanwhile, multi-scale contextual information can capture the interaction between different objects, which has a positive effect on saliency detection. Therefore, we use the MFEM to extract contextual information of high-level features. To have more receptive fields, each MFEM consists of four atrous convolutional layers with dilation rates of 1, 3, 5, 7 and a kernel size of 3 × 3. Compared with ordinary convolutional layers, atrous convolutional layers are able to obtain the same receptive field with fewer parameters. The size of receptive field enlarges with the increase of dilation rate. For a convolutional kernel with size of 3 × 3, the corresponding receptive field with a dilation rate of 3 is 5 × 5, and the corresponding receptive field with a dilation rate of 5 is 7 × 7. The concatenation of these four features with different receptive fields has the ability of perceiving interconnection between different regions so that the MFEM modules can capture contextual information to locate salient objects accurately. Let F m i (i = 3, 4, 5) denote multi-scale features. We fuse F m i using a convolutional layer to achieve the high-level contextual features where Cat (·) and Up i (·) respectively represent concatenation and upsampling operations by a factor 2 i . Since the high-resolution low-level features would increase the computational cost of our model, we downsample low-level features to the same size as F 3 . However, the aggregated low-level features contain lots of non-salient cues. The combination of convolutional operation and activation function can not only smooth the features, but also enhance the representative ability of our model. Therefore, as shown in Fig. 1, a FRM consisting of three convolutional layers and two PReLU layers is introduced to eliminate the interference in low-level features, which is able to obtain low-level features with important detailed information. The processing of the entire module is formulated as follows: where D i (·) represent the downsampling operation by a factor 2 i .

C. TWO-STAGE ATTENTION MODULE
Directly using convolutional features to detect salient objects will generate suboptimal results because convolutional features often contain undesired background information that distracts attention from salient objects. Instead of dealing with features equally in channel and spatial position, we utilize a two-stage attention module to distinguish the importance of each channel and pixel, which makes predicted saliency maps closer to the ground truth. As illustrated in Fig. 2, the two-stage attention module is composed of two parts: channel attention unit and spatial attention unit. Suppose input features F is unfolded according is the i-th channel and C is the total number of channels. A channel attention unit is firstly applied to F for discriminating the importance of channels. The average pooling operation obtains relatively uniform global information, while the maximum pooling operation retains the most important information in features. So both of them conduct the inference of attention. Channel attention takes advantage of the element-wise summation of average pooling features and maximum pooling features to squeeze the spatial position of where W c1 and W c2 denote convolutional kernels and * is convolution operation. After that, we exploit a sigmoid operation on v c to compute the weight of each channel i where ca i is the attention weight of channel i. In order to achieve channel attention, we apply ca i to F where F cai represents the i-th channel attention feature and × is a scalar multiplication operation. Similar to channel attention, spatial attention assigns weights to spatial position, which determines the importance of each location. For summarizing the features, average pooling operation and maximum pooling operation are applied to F ca v s = W s * Cat(AvgPool(F ca ), MaxPool(F ca )), where W s is a convolutional kernel. Spatial attention is implemented by a sigmoid function, i.e., where (x, y) represents pixel coordinate. Then, to implement spatial attention, we apply it to F ca where F csa denotes final attention features, and ⊗ represents element-wise multiplication. Benefiting from the selection ability of the channel attention unit and the spatial attention units, the two-stage attention module can concentrate attention on salient objects adaptively and improve the performance of salient object detection significantly. The experimental results in Section IV support our arguments.

D. RECURRENT GUIDANCE MODULE
Discriminative attention features are obtained by aggregating the processed hierarchical features. Even though hierarchical attention features contain affluent details and contextual information, the attention of the fused features may be biased, resulting in inaccurate predictions. Recurrent networks possess powerful memory by establishing connections with hidden features and input features, which helps to promote the refinement capability of deep network models. In consequence, recurrent networks can gradually refine salient objects to correct biased saliency maps. Unlike conventional recurrent networks, we design a recurrent guidance network to further improve the performance of salient object detection. Specifically, our recurrent network combines the saliency map generated by the output features of previous step with hierarchical attention features as a guidance and uses hidden states to jointly yield the current features. The proposed recurrent guidance module maintains long-term dependencies with previous time steps and gradually refines salient objects in an iterative manner. The details of our recurrent guidance module are as follows: As shown in Fig. 3, the hierarchial attention features F fuse are used as the input of guidance recurrent module to initialize hidden layer H 0 . The initial saliency map S 0 is generated by H 0 . In the recurrent guidance network, we introduce guidance layers, which generate guidance features so that each time step can maintain a dependency on hierarchical attention features. Guidance features G t are produced by the Hadamard product of attention features F fuse and saliency maps S t generated by the last time steps To enhance the features selection and representation capabilities of our method, recurrent structure R t adopts LSTM that replaces input features with guidance features. Benefiting from the ability of gate structures to control the propagation VOLUME 8, 2020 of information, LSTM can adaptively forget interference features and retain effective features. Specifically, gate control structures in a typical LSTM consist of input gate I t , forget gate F t , output gate O t , candidate memoryC t , cell states C t , and hidden states H t (see [16] for more details). They are updated as follows: where W j , U j , and b j (j = f , i, o, c) are convolutional parameters. σ (·) and tanh (·) denote tanh and sigmoid activation functions, respectively. Saliency maps are generated from the hidden states of current time step where Conv 1 (·) represents a 1×1 convolutional layer. In fact, the convolutional layer with a kernel size of 1 × 1 can be used as a classifier to distinguish the category of each pixel. The recurrent guidance module progressively refines saliency maps while suppressing interference information in background, which improves the salient object detection effect to a certain extent. Besides, to make the saliency map more closer to the ground truth, we adopt a total loss function to supervise training the saliency map S. This loss function is formulated as follows: where t is the time step, θ i is the parameter corresponding to S i , and L ce is the cross entropy function, i.e., where N is the total number of pixels and I (·) is the indicator function.

A. EXPERIMENTAL SETUP 1) DATASETS
In order to verify the effectiveness of our method and make it convincing, we conduct experiments on five widely used public standard datasets: ECSSD [24], HKU-IS [7], PASCAL-S [25], SOD [26], and DUTS [27]. ECSSD with 1000 images contains many structurally complex images. HKU-IS composed of 4447 images is one of the relatively large saliency detection datasets. There are many images with multiple disconnected salient objects and low contrast in HKU-IS.
PASCAL-S has 850 natural images selected from the validation set of the PASCAL-VOC 2009 segmentation dataset. SOD consists of 300 complex images that typically contain low-contrast images and natural images with inapparent salient regions. SOD is one of the most challenging datasets. To our knowledge, DUTS is currently the largest saliency detection dataset, containing 10553 training images and 5019 test images. We leverage the relatively comprehensive DUTS training dataset to train our model, because it contains variable salient objects. The DUTS testing dataset and remaining other datasets are only used for testing.

2) EVALUATION METRICS
We evaluate our method and other state-of-the-art methods using three evaluation metrics, including precision-recall (PR) curves, maximum F-measure (maxF) [28], and mean absolute error (MAE) [29]. All these three metrics are widely used to evaluate the performance of saliency detection methods. Precision is the proportion of true salient pixels in the predicted salient regions, and recall is the proportion of detected salient pixels in the truth salient regions. We use S and G to represent the predicted saliency map and the ground truth, respectively. S is converted into a binary mask M using a threshold. The precision and recall are computed by comparing M and G where | · | denotes the total number of non-zero entries in the mask. PR curves is drawn based on the precision and recall calculated for each threshold. MaxF weights precision and recall, which is an overall performance evaluation metric. MaxF is defined as To emphasize the importance of precision, β 2 is usually set to 0.3 as suggested in [24]. MAE describes the dissimilarity between the predicted saliency map and the ground truth by measuring the distance between them, which is formulated as where W and H represent the width and height of the saliency map, respectively.

3) IMPLEMENTATION DETAILS
Our model is implemented on the PyTorch framework and trained on GTX 1080 GPU with Adam optimizer [30]. All images used for training and testing are adaptively adjusted to 352 × 352. FCN's convolutional layers are initialized by pre-trained VGG-16 for classification, while other convolutional layers are initialized by the default setting strategy of PyTorch. The batch size is set as 10. The initial learning rate is 10 −4 and it decays by 10% every 50 iterations. Note that any post-processing steps never used in our testing process.

B. COMPARISON WITH THE STATE-OF-THE-ART
Our method is compared with eight state-of-the-art methods on the aforementioned benchmark datasets, including MDF [7], Amulet [12], RAS [19], RADF [22], PAGR [15], DSS [13], CapSal [31], PAGE [21]. To make a fair comparison, these methods are implemented utilizing the source codes and default settings, or directly use saliency maps predicted by authors.

1) QUANTITATIVE EVALUATION
The PR curves, MAE and maxF of the methods are given for quantitative evaluation. The balance point is the value when precision is equal to recall on PR curves, which is one of the commonly used criteria for discussing the quality of PR curves. A saliency method with a large balance point value indicates that the method has better performance. As can be seen from PR curves in Fig. 4, the balance point of our PR curve is the highest in all datasets except for ECSSD dataset. The PR curves also demonstrate that the performance of our method is comparable to that of PAGE on ECSSD, but it achieves significant performance over other competing methods. Table 1 shows the MAE and maxF values of all methods. Compared with the second best method, the MAE value of our method is reduced by 9.6%, 2.4%, 8.1%, 0.9% on DUTS, ECSSD, HKU-IS and SOD datasets, respectively. This indicates that the gap between our predicted saliency maps and the ground truths is relatively small. As for maxF, VOLUME 8, 2020 it has improved by 2.1%, 0.8%, 1.7%, 1.6% on DUTS, HKU-IS, PASCAL-S and SOD, respectively. Taking all three evaluation metrics into consideration, our method shows better performance than others. Fig. 5 shows the visual comparisons of our method with the state-of-the-art methods. For simple salient objects (the first row of Fig. 5), we find that most of these methods can roughly detect the contours of salient objects, but our method handles the details more finely. In low contrast images (the second to forth rows of Fig. 5), the body of salient objects cannot be completely extracted by most of saliency detection methods. However, our method detects salient objects more accurately and ensures the consistency within the salient regions. For example, in the fourth row of Fig. 5, although the computer has a low color contrast with background, our method precisely detects the entire contour of the computer. In addition, no matter in complex images or images with multiple small salient objects(the fifth to eighth rows of Fig. 5), our method not only highlights salient objects, but also has relatively fine boundaries. Therefore, in terms of visual quality, our method outperforms over other methods.

C. ABLATION ANALYSIS
To further confirm the effectiveness of our model, we conduct ablation analysis to investigate the prominent contributions made by each module of our model.

1) EFFECTIVENESS OF HIERARCHICAL FEATURES
To verify the hierarchial features are beneficial for salient object detection, we compare the strategy of direct fusion of multi-level features (denoted as ''Baseline'') with our hierarchial features fusion strategy (denoted as ''+HF'').
The quantitative evaluation, shown in Table 2, displays the superiority of our hierarchial features. It can be clearly seen in Fig. 6 that the salient areas detected by the ''Baseline'' are not complete. Meanwhile, hierarchial features fusion strategy produces more complete and prominent salient objects, which further proves the effectiveness of our hierarchical features extraction.

2) PERFORMANCE OF TWO-STAGE ATTENTION MODULE
In section III, we claim that the two-stage attention module can focus attention on salient areas to accurately locate salient objects. To confirm it, we add two-stage attention modules (denoted as ''+TSAM'') based on the previous ''+HF''. In Table 2, we find that the two-stage attention module has a certain improvement on both MAE and maxF. The visual results in Fig. 6 show that our TSAM makes the saliency information more prominent and suppresses non-salient regions simultaneously.

3) ANALYSIS OF RECURRENT GUIDANCE MODULE
We add a recurrent guidance module (denoted as ''+RGM'') to ''+TSAM'' to investigate the important role of ''RGM''. As reported in Table 2, the performance of ''+RGM'' is outstanding than ''+TSAM'' in quantitative evaluation. The visual results exhibited in Fig. 6 further confirm that ''+RGM'' is superior to ''+TSAM''. The time step also has an effect on results. If the time step is too long, it will increase the amount of computational complexity. If the time step is too short, the details will refine insufficiently. Therefore, we compare the time step t = 3, 4, 5 respectively. It can be seen from Table 2 that the performance of ''+RGM (t = 4)'' is better than others. In terms of visual effects as shown in Fig. 6, it is obvious that ''+RGM (t = 4)'' is closer to the ground truth and obtains the delicate salient objects boundaries. Therefore, we empirically choose t = 4 as the setting of our model.

V. LIMITATIONS
Similar to most of saliency detection methods, our method cannot accurately predict the saliency maps of all images. Fig. 7 shows some failure cases of our method. The failure predictions of our method are mainly divided into two circumstances. In the one circumstance, the salient objects detected by our method are incomplete and cannot maintain internal continuity. As shown in the first two rows of Fig. 7, this failure case is primarily caused by the high color contrast between the internal parts of the salient region.
In the other failure circumstance, the background is predicted as a salient region. The main reason for this case is that some objects in non-salient areas are quite different from the surrounding environment as shown in the last two rows of Fig. 7. In fact, most of saliency detection methods have FIGURE 7. Failure cases of our method. As can be seen from the figure, our method cannot accurately detect salient objects in these two circumstances.
these two limitations. A possible improvement method is to introduce more complex images into the training datasets.

VI. CONCLUSION
To predict saliency objects with delicate boundaries, we propose a saliency detection method using recurrent guidance network with hierarchical attention features. According to the distinct information involved in multi-level features, the proposed method divides them into high-level and low-level features. The multi-scale features extraction module is introduced to obtain rich contextual information from high-level features, and the feature refinement module is utilized to obtain detailed information from low-level features. Hierarchial features are combined with two-stage attention modules to focus on salient objects. Finally, we design a recurrent guidance network to correct biased attention features and enhance salient objects boundaries. Comparisons with stateof-the-art methods on five public benchmark datasets demonstrate the effectiveness of our method in terms of quantitative assessment and visual subjective quality.