Loading [MathJax]/extensions/MathMenu.js
Cross-Modal and Cross-Level Attention Interaction Network for Salient Object Detection | IEEE Journals & Magazine | IEEE Xplore

Cross-Modal and Cross-Level Attention Interaction Network for Salient Object Detection


Impact Statement:RGB-D salient object detection (SOD) is a crucial preprocessing technique in various computer vision tasks such as image retrieval, image compression, and person search. ...Show More

Abstract:

Most existing RGB-D salient object detection methods utilize the convolutional neural networks (CNNs) to extract features. However, they fail to extract global informatio...Show More
Impact Statement:
RGB-D salient object detection (SOD) is a crucial preprocessing technique in various computer vision tasks such as image retrieval, image compression, and person search. However, improving the detection accuracy has always been a challenging problem. In this article, we propose a CAINet which utilizes Swin Transformer as the backbone. It consists of four components: the CIEM is responsible for fuzing two-modals complementary information; the HCRM aims to accurately locate salient regions in the early stage of prediction; the MAID supplements information of different levels during the prediction process; and the EEM efficiently enhances the edge of saliency map. With these modules cooperating with each other, our CAINet achieves excellent performance. The encouraging results have increased the possibility of applying RGB-D SOD in artificial intelligence fields such as autonomous driving and medical imaging.

Abstract:

Most existing RGB-D salient object detection methods utilize the convolutional neural networks (CNNs) to extract features. However, they fail to extract global information due to the inherent defect of sliding window. On the other hand, with the emergence of depth clues, how to effectively incorporate cross-modal features has become an underlying challenge. In addition, in terms of cross-level feature fusion, most methods do not fully consider the complementarity between different layers and usually adopt simple fusion strategies, thereby leading to the missing of detailed information. To relieve these issues, a cross-modal and cross-level attention interaction network (CAINet) is proposed. First, different from most existing methods, we adopt a two-stream Swin Transformers to extract RGB and depth features. Second, a high-level context refinement module (HCRM) is designed to further extract refined features and give accurate guidance in early prediction stage. Third, we design a cross...
Published in: IEEE Transactions on Artificial Intelligence ( Volume: 5, Issue: 6, June 2024)
Page(s): 2907 - 2920
Date of Publication: 20 November 2023
Electronic ISSN: 2691-4581

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.