I. Introduction
Object detection aims at localizing a set of objects of interest and recognizing their categories in an image. Dense prior has always been a cornerstone to the success of detectors. In classic computer vision, the sliding-window paradigm, in which a classifier is applied on a dense image grid, was leading the detection methods for decades [12], [16], [62]. Modern mainstream one-stage detectors pre-define marks on a dense feature map grid, such as anchors boxes [37], [46], or reference points [59], [85], and predict the relative scaling and offsets to bounding boxes of objects, as well as the corresponding categories. As shown in Fig. 1(a), RetinaNet [37] enumerates k anchor boxes on each grid in H\times W feature map. The total number of anchor boxes is about 10^{4}-10^{5}. Although two-stage pipelines work on a sparse set of proposal boxes, their proposal generation algorithms are still built on dense candidates [20], [47]. As shown in Fig. 1(b), Faster R-CNN selects N (about 300 to 2000) predicted proposal boxes from hundreds of thousands of candidates in Region Proposal Network (RPN) [47], extracts their image features within the corresponding regions, and then refines bounding boxes locations and classifies the categories.