I. Introduction
Salient object detection (SOD) aims at detecting the most visually conspicuous objects or regions in an image [1], [2], [3], [4], [5], [6], [7], [8], [9]. It has a wide range of computer vision applications such as human-robot interaction [10], content-aware image editing [11], image retrieval [12], object recognition [13], image thumbnailing [14], weakly supervised learning [15], etc. In the last decade, convolutional neural networks (CNNs) have significantly pushed forward this field. Intuitively, the global contextual information (existing in the top CNN layers) is essential for locating salient objects, while the local fine-grained information (existing in the bottom CNN layers) is helpful in refining object details [1], [8], [9], [16], [17], [18], [19]. This is why the U-shaped encoder-decoder CNNs have dominated this field [2], [3], [16], [17], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], where the encoder extracts multi-level deep features from raw images and the decoder integrates the extracted features with skip connections to make image-to-image predictions [3], [16], [17], [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [37]. The encoder is usually the existing CNN backbones, e.g., ResNet [38], while most efforts are put into the design of the decoder [30], [31], [32], [33], [35]. Although remarkable progress has been seen in this direction, CNN-based encoders share the intrinsic limitation of extracting features from images in a local manner. The lack of powerful global modeling has been the main bottleneck for CNN-based SOD.