SdBAN: Salient Object Detection Using Bilateral Attention Network With Dice Coefficient Loss

Visual attention plays an important role in saliency detection by highlighting meaningful context regions. In this paper, we present a novel saliency detection method using a bilateral attention network. The proposed network consists of two branches: i) a spatial path using an encoder-decoder structure to learn spatial cues and ii) a context path using an attention mechanism to learn contextual cues. The feature aggregation module is finally used to predict salient objects by concatenating the cues. To optimize the weights of the network in the sense of minimizing the class imbalance problem, we minimize the dice coefficient loss together with the classical cross-entropy loss. The proposed network can predict salient regions in an end-to-end manner without post-processing. Experimental results show that the proposed network achieved better performance than existing state-of-the-art methods in most cases. Furthermore, the proposed network takes only 0.03 seconds to process a $224 \times 224 $ image. The code for the proposed method can be found at the following URL: https://github.com/tiruss/SdBAN


I. INTRODUCTION
Saliency detection aims at extracting the most visually noticeable region in an image. Unlike other segmentation approaches such as semantic segmentation and boundary detection, saliency detection only distinguishes the most visually attractive and interesting object from the background. It can be applied to various computer vision fields such as image segmentation [1], object recognition [2], action recognition [3], weakly supervised semantic segmentation [4]- [7], visual tracking [8], video compression [9], [10] and video summarization [11].
Existing hand-crafted feature-based saliency detection methods commonly measure the contrast. Itti and Koch proposed contrast difference between the center pixel and its neighborhood [12]. Klein and Frintrop used Kullback-Leibler Divergence (KLD) to measure the difference [13]. However, these difference measurement-based saliency detection The associate editor coordinating the review of this manuscript and approving it for publication was Long Wang . methods commonly fails when there is no significant difference between the object and background, or the background has a complex pattern or clutters. Wang et al. applied a learning-based discriminative model to guarantee high performance in various types of domains [14]. To provide a pre-specified prior, they need additional pre-and post-processing steps. Kong et al. proposed an exemplar-aided method that complement heuristic saliency assumptions by leveraging only a few exemplar images [15]. Zeng et al. proposed a game-theoretic method that does not require labeled training data [16]. Zhou et al. proposed a superpixel-based two-layer diffusion process [17].
In recent years, convolutional neural networks (CNNs) have demonstrated unparalleled performance in the salient object detection and segmentation fields. Specifically, fully convolutional networks (FCNs) greatly improve the ability to preserve spatial information [18]. Mnih et al. proposed U-shape structure to reduce the loss of details of an object [19]. By fusing the hierarchical features of the backbone network, the U-shape structure gradually increases the VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Example of learned spatial and context feature maps: (a) feature map computed by the spatial path using the U-net architecture, (b) feature map of the context path using the pixel-wise attention mechanism, (c) fusion of (a) and (b) using the feature fusion block, (d) the input image, (e) the prediction result using the proposed method, and (f) the ground truth mask.
spatial resolution and fills some missing details. For that reason, recent saliency detection studies are based on FCNs and U-shape structures. He et al. proposed a super-pixelwise convolutional neural network using hierarchical contrast features [20]. For each scale of super-pixel, two contrast sequences were fed into the convolutional network for more detailed features. Li and Yu proposed a deep contrast network to emphasize the contrast information [21]. It concatenates a pixel-level FCN stream and a segment-wise spatial pooling stream. A fully connected conditional random field (CRF) is also used for refining the output from the contrast network. Liu et al. used a hierarchical recurrent convolutional neural network for saliency detection [22]. This network consists of two stages: i) generating a coarse output map using a deep CNN and ii) hierarchical refinement of the details using a recurrent CNN. Both Li's and Liu's works commonly used multiscale features that are extracted by convolutional layers. Hou et al. proposed skip-connections between layers to find a salient object in a deep neural network without loss of information [23]. Hu et al. tried to find the salient object by minimizing the loss of each pooling layer and refinement using guided super-pixel filtering [24]. Fu et al.
proposed a deep framework for salient object detection that effectively fuses multi-scale outputs [25]. To fuse differently scaled outputs, they proposed: i) a linear model using a fully connected layer, ii) a nonlinear model using the FCN for concatenation, and iii) a joint fusion of the two models. Edge-based salient object detection approaches were recently proposed in [26], [27]. Zhao et al. proposed an edge guidance network using an explicit edge modeling method [26], which estimates deterministic object boundaries by adding a complementary salient edge to multi-scale information.
Wu et al proposed a stacked cross refinement network for an edge-aware network [27], which simultaneously leans both saliency map and salient object boundaries using consecutively stacked cross refinement units (CRUs). Compared with existing hand-crafted feature-based methods, CNN-based methods can produce generalized results in various domains, and give a significantly imporved performance without using pre-specified priors. However, since these methods learn the entire image, background may be detected as a salient object when the size of the salient object is smaller than the background. This is called class imbalance problem, and we will discuss about related experimental results in section IV-E2.
To solve this problem, we present a novel saliency detection method using an attention mechanism to assign a higher weight to informative regions. The proposed network consists of two branches: spatial path and context path. The spatial path has an encoder-decoder structure with skip-connections to learn the spatial information. On the other hand, the context path has an attention mechanism to learn the context information. We also propose a feature aggregation block to effectively concatenate two branches without loss of information as shown in Fig. 2. To train the proposed network, we minimize a harmonic loss function that combines the dice coefficient and cross entropy losses. The cross entropy loss cannot solve the class imbalance problem by itself since it tends to decrease when the object size is small. The dice coefficient is devised as an index to measure the similarity of two images. By minimizing the dice coefficient loss, background is ignored and only the object region is considered. As a result, minimization of the dice coefficient loss can solve the class imbalance problem. Since learning with only dice coefficient loss becomes unstable, we added the classical cross-entropy loss for stable learning without class imbalance problem.
The main contributions of this work are summarized as follows: 1) We present a bilateral attention network to learn both spatial and context information. The spatial path is an encoder-decoder structure with skip-connection, which is robust to object size variations. The context path assigns a higher weight to the informative region of the image through the attention mechanism. This process will be proved to be robust even when the background is complex and the difference between object and background is not significant. 2) We propose a harmonic loss function that combines the dice coefficient loss and cross-entropy loss for stable learning without class imbalance problem.
3) Extensive experiments show that the proposed method compares favorably to the state-of-the-art methods, both in terms of visual quality and in terms of different metrics.

II. RELATED WORKS A. ATTENTION MECHANISM
Recently, attention mechanism, which makes computation resource concentrated on the informative region of the image, is applied in various deep neural networks. Over the last few years, the attention mechanism has been studied in natural language processing [28]- [30]. Mnih et al. proposed a method to adaptively select a region of interest in an image through a recurrent attention model [19]. To the best of authors' knowledge, this is the first attempt to apply the attention mechanism to the computer vision tasks. However, training the network including the recurrent attention model is a challenging problem since it is not easy for the attention model to focus on a definite point in the image, which is called hard attention problem. To solve that problem, Bahdanau et al. proposed a soft attention model, which calculates attention weights of all input features [31]. This allowed the RNN encoder-decoder network to overcome the limitations of containing all the sentence information in a fixed-length vector. This method significantly improves performance in machine translation. In recent years, attention mechanisms have been introduced into various computer vision applications. Xu et al. applied the recurrent attention model to the field of image captioning by highlighting the area corresponding to each word of the sentence describing the given image [32]. Sermanet et al. enhanced performance of image classification by extracting discriminative regions in the image through the recurrent attention model [33]. Chen et al. replaced average-pooling and max-pooling for multi-scale features by the attention module to increase performance of the semantic segmentation [34]. Li et al. applied the region of interest (ROI) to the object detection field through the attention model [35]. These studies proved that the attention mechanism successfully assigned higher weights to informative regions to increase performance of object detection. Liu et al. proposed pixel-wise contextual attention network (PiCANet) to apply the attention mechanism to saliency detection [36]. In PiCANet, an attention-guided network selectively integrates multi-level contextual information to alleviate distraction of cluttered features. This method is robust to background changes and cases successfully detects objects in most cases. However, it cannot preserve high-level features with semantic information, resulting in a blurred boundary of the object. To solve this problem, Krähenbühl and Koltun used conditional random field (CRF) in post-processing [37]. Feng et al. used boundary-enhanced loss (BEL) with the attention feedback module to detect salient objects [38], where the context of the object is learned through attention, and the boundary of the object is learned through BEL.
The proposed network is different from feature integration-based approaches described above in that our bilateral network can separately obtain both spatial and context information from different paths to preserve the advantages of both paths.

B. ENCODER-DECODER ARCHITECTURE
In computer vision, image segmentation is the process of assigning a pixel-by-pixel label to an entire image, and its performance depends on the ability to preserve multi-scale features. Most existing multi-scale feature handling networks are based on an encoder-decoder architecture. The encoder of the network compresses the information of the object through the layer, where the high-level layer contains detailed information of the object, and the low-level layer contains the context information of the object. Most encoders used a pretrained network to extract general features in an efficient manner using a small amount of training data sets. The feature vectors compressed by the encoder are then reconstructed by the decoding layer. Through this structure, more generalized results can be obtained.
Badrinarayanan et al. proposed SegNet which uses an encoder-decoder structure for semantic segmentation [39]. This is first attempt to apply the encoder-decoder structure to the pixel-wise prediction task. SegNet showed higher localization than simple upsampling based methods. Ronneberger et al. proposed an encoder-decoder structure using skip-connection, called U-net [40]. Skip-connection concatenates pairs of encoder and decoder layers of the same size. A successive layer can then learn to assemble a more precise output based on this information. Saliency detection is distinguished from object detection and segmentation tasks in that the shape of an object is not constant. It is important to create a more generalized network since there is no fixed shape for a salient object. The proposed network uses the U-net structure to obtain generalized spatial information.

III. PROPOSED METHOD
The proposed network consists of spatial and context paths as shown in Fig. 3. The spatial path performs semantic segmentation whereas the context path generates the contextual attention vector of the object. Since the low-level layer in the spatial path has high-resolution spatial information, it is not suitable to find the context of the object through the attention mechanism. Therefore, the context attention block (CAB) is applied to the compressed feature map through convolution and pooling. In order to preserve the characteristics of each path, feature map of each path is concatenated through the feature fusion block (FFB) at the last layer of the decoder.

A. SPATIAL PATH
To extract features, we use the pretrained ResNet-50 as the encoder in the spatial path [41]. This network is modified to be fully convolutional to produce dense feature maps while preserving spatial location. More specifically, we replaced the VOLUME 8, 2020 last fully-connected layers of the original ResNet-50 by four deconvolution blocks to reconstruct features. The reason for using four deconvolution blocks is that the spatial decimation factor of the ResNet-50 is 16 when four max-pooling layers of stride 2 are employed. In addition, skip-connection is applied between pairs of encoder and decoder layers of the same scale to preserve multi-scale features.

B. CONTEXT PATH
One of the problems of the saliency detection task is inconsistent prediction result where the background is complex or the difference between background and object is low. These problems are mainly due to the lack of context. Global average pooling can be used to find global contexts [42], [43]. However, global context just has the high level semantic information, which is not helpful for recovering the spatial information. Therefore, a multi-scale receptive view is needed to restore spatial information successfully. To accurate guide multi-scale features, we design a context attention block (CAB) as shown in Fig. 4. A CAB calculates the channel attention vector for each scale feature. Both high-and low-level features provide a consistent guidance and discrimination information of features. In this way, the channel attention vector can select discriminative features.

1) CONTEXT ATTENTION BLOCK
In the FCN architecture, the convolution operator has a score map as an output. The score map is interpreted as the probability of a class for each pixel. Let s be the scale of the feature map, the score y s is the sum of all feature maps as where S represents the largest scale, x i the i-th scale feature map, and w i the i-th scale convolution kernel.
Since the convolution operation takes all input feature maps with an equal weight, the predicted output may become incorrect when background is noisy or the object is relatively small. To solve this problem, we use a weighting parameter α, which becomes large for a highly discriminative region of the object.
To determine the optimal value of y s , global average pooling (GAP) is performed before the max-pooling layer in the encoder and d ∈ {1, . . . , D} scores of feature maps are obtained. The final weight vector α d is then obtained by 1×1 convolution operation which maps the score between 0 and 1 through the softmax function.
Finally, we construct an attended contextual feature y A as a weighted sum of α d and the original feature map as As shown in Fig. 5, the proposed context attention block (CAB) weights the discriminative region in the feature maps. The output of the CAB can be regarded as the heat map of attention.
Components of the CAB look similar to the squeeze-andexcitation (SE) block proposed by Hu et al. [44]. The CAB is different from the SE block in that the intermediate fully connected layer is replaced with a 1 × 1 convolution layer to preserve spatial relationship and reduce computational overhead.

C. DICE SIMILARITY COEFFICIENT LOSS
The size of a salient object is often much smaller than that of background. This makes the learning process get  trapped in a local minimum of the loss function yielding a network whose predictions are strongly biased towards the background. To solve this problem, weighted cross-entropy loss, class-balanced cross-entropy loss [45] are used in [23], [46].
Weighted cross entropy (WCE) is a variant of CE where all positives get weighted by coefficient β and defined as where p i ∈ P be the predicted saliency map, and g i ∈ G the corresponding ground truth. If β is larger than the unity, the foreground gets more weights, and vice versa.
For the pixel-wise prediction, Xie and Tu used a simpler strategy called class-balanced cross-entropy (CBCE) that adaptively weights positives and negatives as [45] where β = |N − | / |N | and 1 − β = |N + | / |N |. |N − | and |N + | represent the saliency and non-saliency maps, respectively. This simple approach can solve the class imbalance problem. However, Deng et al. argued that the CBCE loss causes the 'thickness' in the edge detection task [47]. This is due to the nature of the cross entropy loss. More specifically, the cross entropy loss is calculated as the average of per-pixel loss, and the per-pixel loss is independently calculated without considering whether its adjacent pixels are salient or not. As a result, the cross entropy loss considers loss in a local sense rather than the global sense.
Milletari et al. proposed another objective function that maximize the dice coefficient between images [48] to solve class-imbalance problem. The dice coefficient is an index that measures the overlap between the ground truth and the prediction output in segmentation-like tasks. the dice coefficient denoted as D can be computed as In saliency detection tasks, the ground truth and predicted saliency maps can be viewed as two sets. In (6), the denominator considers the total number of saliency maps at the global scale, while the numerator considers the overlap between the two sets at a local scale. Therefore, the dice coefficient loss considers the loss information in both local and global manners. The dice coefficient in (6) is minimized when its gradient with respect to p i is equal to zero as where is a smoothing term to avoid division by zero.
To make the converged become zero, we modified the loss as The class imbalance problem can be solved by minimizing (8). However, since the dice coefficient loss can only learn about object, the learning process is unstable due to the high variance. To learn both object and background, we used binary cross-entropy loss (1 − p i )). (9) VOLUME 8, 2020 FIGURE 7. Subjective comparison of saliency detection performance using different combinations of losses. The proposed method using both binary cross-entropy and dice losses generated better saliency detection result than using only CBCE loss and DSS [23] using CBCE loss. The total loss function, denoted as L T , is the sum of the dice coefficient and the binary cross entropy losses as where τ is the weighting parameter to balance the effect of L D and L CE .

D. FEATURE FUSION BLOCK
Features from the proposed dual path network have different types of representation. Therefore, a simple concatenation deteriorates the performance. The information captured by the spatial path encodes most of rich detail information.
On the other hand, the information captured by the context path mainly encodes context information. In other words, the spatial path extracts a low-level feature map, whereas the context path extracts a high-level feature map. Therefore, we present a feature fusion block (FFB) that concatenates features of different levels without loss of information as shown in Fig. 8. Given various levels of features, we first concatenate the output features of both spatial and context paths. Next, the integrated features obtained by convolution operation and batch normalization. We also obtain the attention weight vector through the softmax function after global average pooling in the integrated feature in the similar way of SENet [44]. The weight vector guides the correct feature selection in the integrated feature.
The network architecture of the proposed method is summarized in Table 1.

A. DATASETS
We used five popular saliency benchmark datasets to evaluate the performance of our method. SOD dataset has 300 images with complex background and multiple objects per image [49]. HKU-IS dataset consists of 4,447 low-contrast images with multiple objects [50]. DUT-O dataset consists of 5,168 challenging images with complex background and one or more objects per image [51]. DUTS dataset consists of DUTS-TR consisting of 10,553 images for training and DUTS-TE consisting of 5,019 challenging images for testing [52]. ECSSD dataset consists of 1,000 images of various types and sizes [53].

B. EVALUATION METRICS
The performance was evaluated using mean absolute error (MAE), precision-recall (PR) curve, F-measure [50], weighted F-measure [54], and S-score [55] which are commonly used in salient object detection.

1) MAE
The mean absolute error (MAE) between the predicted output p i and the ground truth g i is defined as where W and H respectively represent the width and height of images.

2) PRECISION-RECALL (PR) CURVE
The PR curves are calculated from the precision and recall values of predicted output p i and ground truth g i given a pre-specified threshold between 0 and 255. Specifically, the PR curve reflects the object retrieval performance in the sense of both precision and recall by binarizing the final saliency map using different thresholds.

3) F-MEASURE (F β )
The F-measure, denoted as F β , is an overall performance measurement, and is computed by the weighted product of precision P and recall R as where β 2 is set to 0.3 according to previous researches to assign a higher weight on precision than recall. More specifically, maximum F-measure, denoted as maxF β , is associated with the maximum F-measure value computed from the PR curve, while average F-measure, denoted as meanF β uses the adaptive threshold for binarization.

4) WEIGHTED F-MEASURE (F w β )
Margolin et al. proposed weighted F-measure, denoted as F w β , to compensate for the drawback of the original F-measure by considering both pixel dependency and pixel importance with an appropriate weight as [54].
where P w and R w respectively represegnt the weighted precision and recall. The weighted F-measure is different from the original F-measure in that it directly compares a non-binary map using a binary ground truth without thresholding to avoid the interpolation flaw. β 2 = 0.3 is used to give more weight the precision more than recall. More details about this metric can be found in [54].

5) S-MEASURE (S α )
Fan et al. proposed S-measure, denoted as S α , to quantify the spatial structure similarities (SSIM) of the saliency map, which is widely used in the quality assessment (IQA) field, and is defined as where the weighting coefficient α controls the balance between two terms, and α = 0.5 was used according to previous researches. S 0 and S r respectively represent the object-aware and region-aware structural similarities. More details about this metric can be found in [55].

C. IMPLEMENTATION DETAILS
The proposed model was implemented on the TensorFlow framework with a single GTX 1080 Ti GPU for acceleration. For a fair comparison with previous works, the proposed model was trained with the DUTS-TR dataset [30]. For data augmentation, we resized each image to 256 × 256, and then used random cropping and random mirror-flipping for training. We trained our model using Adam optimizer [56], with initial learning rate 0.002 decayed down to 0.00003 per epoch, 200 epochs and mini-batch size 16. It took about 6 hours to converge. For the test, we resized the image to 224 × 224 to get the prediction result, and then restore it back to the original size. Using Resnet-50 as a backbone [41], it took 0.03 second to predict one image.

D. ABLATION STUDY 1) EFFECTIVENESS OF THE SdBAN
To demonstrate the effectiveness of the proposed network, we investigated each component in the proposed network as shown in Table 2, where Baseline represents the U-Net [40] without SdBAN. In Table 2, we designed our ablation study using three different settings. w/o Dice means only use cross-entropy loss, w/o CAB means the baseline with harmonic loss function, and w/o FFB means simple concatenation of feature maps of two paths. The ablation study demonstrated that each component contributed fairly  TABLE 3. Comparison of the saliency detection performance of 14 methods including ours in the sense of max F β , mean F β , and MAE. The best and second-best results are highlighted in red and blue, respectively. to the overall performance. In particular, the CAB made the significant contribution in the sense of F w β . Fig. 9 shows the result of using different values of τ ∈ {0, 0.25, 0.5, 0.75, 1} that balances each loss. τ = 0 means that only binary cross entropy loss is used for learning, whereas τ = 1 means that only dice coefficient loss is used. We found τ = 0.5 was the optimal by experiment. Since the dice coefficient loss is more effective than the binary cross entropy loss, a higher τ tends to give higher performance. Fig. 10 shows results of learning with only one of the two loss functions. The difference in MAE between only cross entropy loss and only dice coefficient was not significant. However, qualitative evaluation shows the characteristics of each loss. Prediction results based on binary cross entropy loss and dice coefficient loss respectively. The binary cross entropy loss has a soft boundary but takes into account the context of the object. The dice coefficient loss has a sharp boundary, but does not consider the context of the object.

3) EFFECTIVENESS OF THE CONTEXT PATH
We used frequently used networks as backbone. Specifically, we used VGG16 [57] and ResNet50 [41] with six multi-scale feature maps 224, 112, 56, 28, 14, and 7. We did not consider 224 because it is so close to the input that the receptive field becomes very small. Figs. 12 (c) and 12 (d) respectively show CAB outputs of 112 and 56. Instead of using the single path that separately adds attention maps of each feature scale, we added the bilateral path as a separate path to learn the concatenated multi-scale attention maps. To concatenate the attention maps, 28 and 14 require ×8 and ×16 upsampling, respectively. This scheme can neither obtain fine information,   nor reduce errors. Experimental results showed that performance improved in 112 and 56, but not from 28.

1) QUANTITATIVE EVALUATION
Results of quantitative comparison of the proposed network with 12 state-of-the-art methods are shown in Table 3 and  Table 4. As shown in the tables, our method outperforms other methods for all the seven benchmark datasets in the sense of MAE. Our method also gives the first or second performance in the sense of maxF β , mean meanF β , F w β , and S α . F-measures of our method were relatively low compared to MAE. In particular, the maxF β is low since the PR curves of the proposed network are short as shown in Fig. 13. The shorter the PR curve, the better the binarization of the prediction output without blurring. In the F-measure curve of Fig. 13, we can see that the proposed method has a constant value, while the other methods differently behave VOLUME 8, 2020 TABLE 4. Comparison of the saliency detection performance of 14 methods including ours in the sense of F w β and S α . The best and the second-best results are highlighted in red and blue, respectively. according to the threshold. Therefore, our average value of the F-measure curve, denoted as meanF β , shows the best performance in most cases. As shown in Fig. 4, our method gives the best results over all five datasets in the sense of F w β , while it gives either the best or second-best results in the sense of S α , which was compensated for the flaws of conventional metrics.

2) QUALITATIVE EVALUATION
For a subjective evaluation, we compares the saliency detection results of our method over several challenging images with existing state-of-the-art methods. As shown in the column of Fig. 11, our method generated sharper object boundaries and was less sensitive to background clutters than other methods. For example, the first three rows of Fig. 11 show the  case of irregular boundary. Our method accurately detected object boundaries, while other methods produced blurry boundaries. The fourth row include four objects, all of which were correctly detected by our method. Other methods missed the right most person since the existing network is trained with a center-biased training dataset. In other words, conventional learning-based methods cannot successfully detect an off-centered object. On the other hand, our method can detect an object located anywhere in the image by learning the global context of an object. However, since our method learns the global context of an object, it detect outer object well. In the case of complex background represented in the fifth to seventh rows, other methods detect the background as an object. The fifth, sixth, and seventh rows of Fig. 11 show the case of complex background. Our methods correctly detected objects in a robust manner, while others could not. The eighth row of Fig. 11 shows the case of unusually shaped structures, which was correctly detected by our method, but not by others. The ninth and tenth rows of Fig. 11 show the case of small objects, which were correctly detected by our method.

F. FAILURE CASE
Although the proposed method correctly detected the salient object in most cases, it fails in some cases. In the first low of Fig. 14, the shadow of the plane is erroneously detected. Differentiation of a real object from its shadow is still challenging because it is similar to the object and the contrast also changes drastically. In the second row, the salient object is blurred, while the track lane is distinct. This is the case when the track lane is detected as a salient object instead of the running person. This is because most of the learning data are objects with distinct characteristics. In the third low, the salient object is a person, but the colors of the clothes vary, and the background and the color of the clothes are similar to background clutters. Also, the person to the right of the object can be a salient object in some cases. In the proposed method, the right person is detected as a salient object, and false detection occurs.
In summary, a deep learning based method is also sensitive to changes in contrast. Therefore, we will carry out a study on the regularization term in order to obtain generalized results in the future research.

G. CAMOUFLAGE CASE
Small objects and complex backgrounds are one of the factors that make salient object detection task difficult. However, if the difference between the object and the background is very small as shown in Fig. 15, it is also a challenging problem even using human eyes. In this case, if the network does not understand the context of the object, it is difficult to obtain the correct prediction result. In Fig. 15, DSS [23] creates a grid artifact when it can not detect a salient object. This problem is the effect of the dilated convolution. Dilated convolution has the advantage that the receptive field can be increased without increasing the parameter. However, if the difference between the background and the object is small, the background is recognized as an object and vice versa. To alleviate this problem, DSS used CRF [37] as post processing. PiCANet [36] used a similar attention mechanism to the proposed method. However, as shown in the experimental results, the attention mechanism of PiCANet did not find a discriminative part of the image. Unlike the previous two methods, the proposed method sensitively responds to the small context of the object. It is not enough to predict the detail of an object, but the discriminative region of the object is well detected.

H. PROCESSING TIME
Processing time of the proposed method is compared with other methods as shown in Table 5.

V. CONCLUSION
In this paper, we proposed a novel saliency detection method using a bilateral attention network (SdBAN) consisting of: i) a spatial path containing an encoder-decoder structure to learn the spatial information of the salient object and ii) a context path containing the attention-module structure to learn the context information. Weight vectors of different scales in the attention network are concatenated through the feature fusion module at the last layer to effectively preserve the information of each path. In addition, effective learning is achieved by incorporating a novel loss function based on the invariant index of the salient object scale and dice coefficient loss along with the cross entropy loss. As a result of the comparison with the state-of-the-art methods for five different datasets, we demonstrate that the proposed network performs best in most cases. The proposed method also outperforms existing methods in the sense of processing speed in frames per second. In addition to the quantitative evaluation, qualitative performance of the proposed method is much better than others especially in camouflage cases.