Journals & Magazines >IEEE Transactions on Pattern ... >Volume: 45 Issue: 12

Sparse R-CNN: An End-to-End Framework for Object Detection

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Object detection serves as one of most fundamental computer vision tasks. Existing works on object detection heavily rely on dense object candidates, such as

$k$ anchor b...Show More

Metadata

Abstract:

Object detection serves as one of most fundamental computer vision tasks. Existing works on object detection heavily rely on dense object candidates, such as

$k$ anchor boxes pre-defined on all grids of an image feature map of size

$H\times W$ . In this paper, we present Sparse R-CNN, a very simple and sparse method for object detection in images. In our method, a fixed sparse set of learned object proposals (

$N$ in total) are provided to the object recognition head to perform classification and localization. By replacing

$HWk$ (up to hundreds of thousands) hand-designed object candidates with

$N$ (e.g., 100) learnable proposals, Sparse R-CNN makes all efforts related to object candidates design and one-to-many label assignment completely obsolete. More importantly, Sparse R-CNN directly outputs predictions without the non-maximum suppression (NMS) post-processing procedure. Thus, it establishes an end-to-end object detection framework. Sparse R-CNN demonstrates highly competitive accuracy, run-time and training convergence performance with the well-established detector baselines on the challenging COCO dataset and CrowdHuman dataset. We hope that our work can inspire re-thinking the convention of dense prior in object detectors and designing new high-performance detectors.

Published in: IEEE Transactions on Pattern Analysis and Machine Intelligence ( Volume: 45, Issue: 12, December 2023)

Page(s): 15650 - 15664

Date of Publication: 04 July 2023

ISSN Information:

PubMed ID: 37402189

DOI: 10.1109/TPAMI.2023.3292030

Funding Agency:

Contents

I. Introduction

Object detection aims at localizing a set of objects of interest and recognizing their categories in an image. Dense prior has always been a cornerstone to the success of detectors. In classic computer vision, the sliding-window paradigm, in which a classifier is applied on a dense image grid, was leading the detection methods for decades [12], [16], [62]. Modern mainstream one-stage detectors pre-define marks on a dense feature map grid, such as anchors boxes [37], [46], or reference points [59], [85], and predict the relative scaling and offsets to bounding boxes of objects, as well as the corresponding categories. As shown in Fig. 1(a), RetinaNet [37] enumerates k anchor boxes on each grid in $H\times W$ feature map. The total number of anchor boxes is about $10^{4}$ - $10^{5}$ . Although two-stage pipelines work on a sparse set of proposal boxes, their proposal generation algorithms are still built on dense candidates [20], [47]. As shown in Fig. 1(b), Faster R-CNN selects $N$ (about 300 to 2000) predicted proposal boxes from hundreds of thousands of candidates in Region Proposal Network (RPN) [47], extracts their image features within the corresponding regions, and then refines bounding boxes locations and classifies the categories.

References is not available for this document.

Sparse R-CNN: An End-to-End Framework for Object Detection

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Sparse R-CNN: An End-to-End Framework for Object Detection

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?