Impact Statement:RGB-D salient object detection (SOD) is a crucial preprocessing technique in various computer vision tasks such as image retrieval, image compression, and person search. ...Show More
Abstract:
Most existing RGB-D salient object detection methods utilize the convolutional neural networks (CNNs) to extract features. However, they fail to extract global informatio...Show MoreMetadata
Impact Statement:
RGB-D salient object detection (SOD) is a crucial preprocessing technique in various computer vision tasks such as image retrieval, image compression, and person search. However, improving the detection accuracy has always been a challenging problem. In this article, we propose a CAINet which utilizes Swin Transformer as the backbone. It consists of four components: the CIEM is responsible for fuzing two-modals complementary information; the HCRM aims to accurately locate salient regions in the early stage of prediction; the MAID supplements information of different levels during the prediction process; and the EEM efficiently enhances the edge of saliency map. With these modules cooperating with each other, our CAINet achieves excellent performance. The encouraging results have increased the possibility of applying RGB-D SOD in artificial intelligence fields such as autonomous driving and medical imaging.
Abstract:
Most existing RGB-D salient object detection methods utilize the convolutional neural networks (CNNs) to extract features. However, they fail to extract global information due to the inherent defect of sliding window. On the other hand, with the emergence of depth clues, how to effectively incorporate cross-modal features has become an underlying challenge. In addition, in terms of cross-level feature fusion, most methods do not fully consider the complementarity between different layers and usually adopt simple fusion strategies, thereby leading to the missing of detailed information. To relieve these issues, a cross-modal and cross-level attention interaction network (CAINet) is proposed. First, different from most existing methods, we adopt a two-stream Swin Transformers to extract RGB and depth features. Second, a high-level context refinement module (HCRM) is designed to further extract refined features and give accurate guidance in early prediction stage. Third, we design a cross...
Published in: IEEE Transactions on Artificial Intelligence ( Volume: 5, Issue: 6, June 2024)