Loading web-font TeX/Math/Italic
Sparse R-CNN: An End-to-End Framework for Object Detection | IEEE Journals & Magazine | IEEE Xplore

Sparse R-CNN: An End-to-End Framework for Object Detection


Abstract:

Object detection serves as one of most fundamental computer vision tasks. Existing works on object detection heavily rely on dense object candidates, such as k anchor b...Show More

Abstract:

Object detection serves as one of most fundamental computer vision tasks. Existing works on object detection heavily rely on dense object candidates, such as k anchor boxes pre-defined on all grids of an image feature map of size H\times W. In this paper, we present Sparse R-CNN, a very simple and sparse method for object detection in images. In our method, a fixed sparse set of learned object proposals ( N in total) are provided to the object recognition head to perform classification and localization. By replacing HWk (up to hundreds of thousands) hand-designed object candidates with N (e.g., 100) learnable proposals, Sparse R-CNN makes all efforts related to object candidates design and one-to-many label assignment completely obsolete. More importantly, Sparse R-CNN directly outputs predictions without the non-maximum suppression (NMS) post-processing procedure. Thus, it establishes an end-to-end object detection framework. Sparse R-CNN demonstrates highly competitive accuracy, run-time and training convergence performance with the well-established detector baselines on the challenging COCO dataset and CrowdHuman dataset. We hope that our work can inspire re-thinking the convention of dense prior in object detectors and designing new high-performance detectors.
Page(s): 15650 - 15664
Date of Publication: 04 July 2023

ISSN Information:

PubMed ID: 37402189

Funding Agency:


I. Introduction

Object detection aims at localizing a set of objects of interest and recognizing their categories in an image. Dense prior has always been a cornerstone to the success of detectors. In classic computer vision, the sliding-window paradigm, in which a classifier is applied on a dense image grid, was leading the detection methods for decades [12], [16], [62]. Modern mainstream one-stage detectors pre-define marks on a dense feature map grid, such as anchors boxes [37], [46], or reference points [59], [85], and predict the relative scaling and offsets to bounding boxes of objects, as well as the corresponding categories. As shown in Fig. 1(a), RetinaNet [37] enumerates k anchor boxes on each grid in H\times W feature map. The total number of anchor boxes is about 10^{4}-10^{5}. Although two-stage pipelines work on a sparse set of proposal boxes, their proposal generation algorithms are still built on dense candidates [20], [47]. As shown in Fig. 1(b), Faster R-CNN selects N (about 300 to 2000) predicted proposal boxes from hundreds of thousands of candidates in Region Proposal Network (RPN) [47], extracts their image features within the corresponding regions, and then refines bounding boxes locations and classifies the categories.

Contact IEEE to Subscribe

References

References is not available for this document.