A Survey of Deep Learning-Based Object Detection

Object detection is one of the most important and challenging branches of computer vision, which has been widely applied in people’s life, such as monitoring security, autonomous driving and so on, with the purpose of locating instances of semantic objects of a certain class. With the rapid development of deep learning algorithms for detection tasks, the performance of object detectors has been greatly improved. In order to understand the main development status of object detection pipeline thoroughly and deeply, in this survey, we analyze the methods of existing typical detection models and describe the benchmark datasets at first. Afterwards and primarily, we provide a comprehensive overview of a variety of object detection methods in a systematic manner, covering the one-stage and two-stage detectors. Moreover, we list the traditional and new applications. Some representative branches of object detection are analyzed as well. Finally, we discuss the architecture of exploiting these object detection methods to build an effective and efficient system and point out a set of development trends to better follow the state-of-the-art algorithms and further research.


I. INTRODUCTION
O BJECT detection has been attracting increasing amounts of attention in recent years due to its wide range of applications and recent technological breakthroughs. This task is under extensive investigation in both academia and real world applications, such as monitoring security, autonomous driving, transportation surveillance, drone scene analysis, and robotic vision. Among many factors and efforts that lead to the fast evolution of image object detection techniques, a notable contribution should be attributed to the development of deep convolution neural networks and GPUs computing power. At present, deep learning model has been widely used in the whole field of computer vision, including general image object detection and domain-specific object detection. Stateof-the-art object detectors almost use deep learning networks as their both backbone and detection network for extracting features from the input images, classification and localization respectively. Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched domains of image object detection include multi-categories detection, edge detection, salient object detection, pose detection, face detection and pedestrian detection. Because a rising number of applications need scene understanding, as an important part image object detection has been widely used in many areas of modern life. So far many benchmarks play an important role in object detection field, such as Caltech [1], KITTI [2], ImageNet [3], PASCAL VOC [4], and MS COCO [5]. In ECCV VisDrone 2018 contest, the organizer release a novel dataset benchmark contains a large amount of images and videos based on the drone platform.
Pre-existing domain-specific image object detectors usually can be divided into two categories, the one is two-stage detector, the most representative one, Faster R-CNN [6]. The other is one-stage detector, such as YOLO [7], SSD [8]. Twostage detectors have high localization and object recognition accuracy, while the one-stage detectors achieve high inference speed. The two stage of two-stage detectors is divided by ROI (Region of Interest) pooling layer. For instance, in Faster R-CNN, the first stage, called RPN, a Region Proposal Network, proposes candidate object bounding boxes. The second stage, features are extracted by RoIPool operation from each candidate box for the following classification and bounding-box regression missions [9]. Fig.1 (a) shows the basic architecture of two-stage detectors. The one-stage detectors propose predicted boxes from input images directly without region proposal step, thus they are time efficient and can be used for real-time devices. Fig.1 (b) exhibits the basic architecture of one-stage detectors.
Our survey is focus on describing and analyzing deep learning based image object detection. The existing surveys always cover a series of domain of general object detection and may not contain the-state-of-the-art methods which provide some novel solutions and newly directions of these tasks because of rapid development. We list very novel solutions proposed recently but neglect to discuss the basics so that readers can see the cutting edge of the field more easily. Different from previous object detection surveys, in this paper we systematically and comprehensively review deep learning based object detection methods and most importantly the up to date detection solutions while research trends. Our survey is featured by in-depth analysis and discussion in various aspects, many of which, to the best of our knowledge, are the first time in this field. It is our intention to provide an overview how different deep learning methods are being used rather than a full summary of all related papers. To get into the field, we recommend readers refer to [10] [11] [12] for more details of early methods.
The rest of the paper is organized as follows. Image object detectors need a powerful backbone network for rich feature extracting. We discuss backbone networks in section 2 below. The typical pipeline domain-specific image detectors act as basics and milestone of the task. In section 3, we will elaborate the most representative and pioneering deep arXiv:1907.09408v1 [cs.CV] 11 Jul 2019 learning-based approaches proposed before June 2019. The common used datasets and metrics will be described in section 4. The analyses of general image object detection methods are systematically explained in section 5. In section 6, we describe five typical fields for object detection and several popular branches of object detection. The development trend is summarized in section 7.

II. BACKBONE NETWORKS
Backbone network is acting as the basic feature extractor for object detection task which takes images as input and outputs feature maps of the corresponding input image. Most of these networks are the network for classification task taking out the last fully connected layers. The improved version of basic classification network is also available. For instance, Lin et al. [13] add or subtract layers or replace some layers with special designed layers. To better meet specific requirements, some works [7] [14] utilize the newly designed backbone for feature extracting.
For different requirements about accuracy vs. efficiency, people can choose deeper and densely connected backbones, like ResNet [9], ResNeXt [15], AmoebaNet [16] etc. or lightweight backbones like MobileNet [17], ShuffleNet [18], SqueezeNet [19], Xception [20], MobileNetV2 [21] etc. When applied to mobile devices, lightweight backbones can meet the requirements. Wang et al. [22] propose a novel real-time object detection system by combining PeleeNet with [8] and optimizing the architecture for fast processing speed. But the more precise applications need high accuracy thus complicated backbones. On the other hand, the real-time acquirements like video or webcam not only need high processing speed but high accuracy [7], which require finely designed backbone to adapt to the detection architecture also make a trade-off between speed and accuracy.
To explore more competitive detecting accuracy, deeper and densely connected backbone is adopting to replace the shallower and sparse connected counterpart. He et al. [9] utilize ResNet [23] rather than VGG [24] which is adopted in Faster R-CNN [6] for further accuracy gain because of its high capacity to capture rich features.
The newly high performance classification networks can improve the precision and reduce the complexity of object detection task. This is an effective way to further improve network performance because backbone network is acting as a feature extractor. As is known to all, the quality of the features determines the upper bound of network performance, thus it is an important step that needs further exploration. Please refer to [25] for more details.

III. TYPICAL BASELINES
With the advent of deep learning and increasing computing power, great progress has been made in general object detection domain. When the first CNN-based object detector R-CNN was proposed, a series of significant contributions have been made which promote the development of general object detection. We introduce some representative object detection architectures for beginners to get started in this domain.

A. R-CNN
R-CNN is a region based CNN detector. As Ross Girshick et al. [26] propose R-CNN which could be used in object detection tasks, their works are the first to show that a CNN could lead to dramatically higher object detection performance on PASCAL VOC datasets [4] than those systems based on simpler HOG-like features. Deep learning method is verified effective and efficient in the field of object detection.
R-CNN detector consists of four modules. The first module generates category-independent region proposals. The second module extracts a fixed-length feature vector from each region proposal. The third module is a set of class-specific linear SVMs to classify the objects in one image. The last module is a bounding-box regressor for precisely bounding-box prediction. For detailed, first, to generate region proposals, the authors adopt selective search method. Then, a CNN is used for extracting a 4096-dimensional feature vector from each region proposal. Because the fully connected layer needs input vectors of fixed length, the region proposal features should have the same size. The authors adopt a fixed 227 × 227 pixel as the input size of CNN. As we know, the objects in various images have different size and aspect ratio, which makes the region proposals extracted by the first module different in size. Regardless of the size or aspect ratio of the candidate region, the authors warp all pixels in a tight bounding box around it to the required size 227 × 227. The feature extraction network consists of five convolutional layers and two fully connected layers. And all CNN parameters are shared across all categories. Each category trains categoryindependent SVMs which dont share parameters between different SVMs.
Pre-training on lager dataset followed by fine-tuning on the specified dataset is a good training method for deep convolutional neural networks to achieve fast convergence. First, Ross Girshick et al. [26] pre-train the CNN on a large scale dataset (ImageNet classification dataset [3]). The last fully connected layer is replaced by the CNNs ImageNet specific 1000-way classification layer. The next step is fine-tuning the CNN parameters on the warped proposal windows uses SGD (stochastic gradient descent). The last fully connected layer is the (N+1)-way classification layer (N: object classes, 1: background) which is randomly initialized.
When setting positive examples and negative examples the authors divide into two parts. The one is defining the IoU (intersection over union) overlap threshold 0.5 in fine-tuning process, below which region proposals are defined as negatives while surpass which object proposals are defined as positives. As well, the object proposals whose maximum IoU overlap with a ground-truth class are assigned to the ground-truth box. The other is setting parameters when training SVMs. In contrast, only the ground-truth boxes are taken as positive examples for their respective classes and proposals have less than 0.3 IoU overlap with all ground-truth instances of one class as a negative proposal for that class. Because those proposals with overlap between 0.5 and 1 but not ground truth expand the number of positive examples by approximately 30×, the big set can avoid overfitting during fine-tuning the  shows the basic architecture of one-stage detectors, which predicts bounding boxes from input images directly.Yellow cubes are a series of Conv layers (called a block) with the same resolution in backbone network, because of down-sampling operation after one block, the size of the following cubes gradually becoming small. Thick blue cubes are a series of Conv layers contains one or more convolutional layers. The flat blue cube demonstrates the RoI pooling layer to generate features of an object with the same size. entire network effectively.

B. Fast R-CNN
R-CNN proposed a year later, Ross Girshick [27] proposed a faster improved version of R-CNN, called Fast R-CNN [27]. Because R-CNN performs a ConvNet forward pass for each region proposal without sharing computation, R-CNN takes a long time on SVMs classification. Fast R-CNN extracts features from an entire input image and then passes the region of interest (RoI) pooling layer to get the fixed size features as the input of the followed classification and bounding box regression fully connected layer. The features are extracted from the entire image once and are sent to CNN for classification and localization at a time compared to R-CNN inputs each region proposals to CNN, which can save a lot of time used for CNN processing and large disk storage to store a great deal of features. As mentioned above, training R-CNN is a multi-stage process which covers pretraining stage, fine-tuning stage, SVMs classification stage and bounding box regression stage. Fast R-CNN is a one-stage end-to-end training process using a multi-task loss on each labeled RoI to jointly train for classification and bounding box regression.
Another improvement is Fast R-CNN uses a RoI pooling layer to extract a fixed size feature map from region proposals have different size. This operation with no need for warping regions and reserves the spatial information of features of region proposals. For fast detection, Ross Girshick uses truncated SVD which accelerates the forward pass of computing the fully connected layers.
Experiment results show that Fast R-CNN has 66.9% mAP while R-CNN has 66.0% on PASCAL VOC 2007 dataset [4]. Training time drops to 9.5 hours as compared to R-CNN with 84h, 9 times faster. For test rate (s/image), Fast R-CNN with truncated SVD (0.32s) is 213× faster than R-CNN (47s). Those experiments were proceeding on an Nvidia K40 GPU, all of which demonstrated that Fast R-CNN did accelerate object detection.

C. Faster R-CNN
Faster R-CNN [6] makes an improvement in region-based CNN baseline after Fast R-CNN proposed 3 months. Fast R-CNN uses selective search for proposing RoI, which is slow and needs the same running time as the detection network. Faster R-CNN replaces it with a novel RPN (region proposal network) that is a fully convolutional network to efficiently predict region proposals with a wide range of scales and aspect ratios. RPN accelerates the generating speed of region proposals as well as shares fully-image convolutional features and a common set of convolutional layers with the detection network. The procedure is simplified in Fig.3 (b). Another novel method for different sized object detection is using multi-scale anchors as reference. The anchors can greatly simplify the process of generating various sized region proposals with no need of multiple scales of input images or features. On the outputs (feature maps) of the last shared convolutional layer, sliding a fixed size window (3 × 3), the center point of each feature window is relative to a point of the original input image which is the center point of k (3 × 3) anchor boxes. The author set anchor boxes have 3 different scales and 3 aspect ratios. The region proposal is parameterized relative to a reference anchor box. Then measure the distance between predicted box and the corresponding ground truth to optimize the location of the predicted boxes.
Experiments indicated that Faster R-CNN has greatly improved both precision and detection efficiency. On PASCAL VOC 2007 test set, Faster R-CNN achieved mAP of 69.9% as compared to Fast R-CNN of 66.9% with shared convolutional computations. As well, total running time of Faster R-CNN (198ms) is nearly 10 times lower than Fast R-CNN (1830ms) with the same VGG [24] backbone, and processing rate is 5fps vs. 0.5fps.

D. Mask R-CNN
Mask R-CNN [9] is an extending work to Faster R-CNN mainly for instance segmentation task. Regardless of the adding parallel mask branch, Mask R-CNN can be seen a more accurate object detector. Kaiming He et al. use Faster R-CNN with a ResNet [23]-FPN [13] (feature pyramid network, a backbone extracts RoI features from different levels of the feature pyramid according to their scale) backbone for feature extraction achieves excellent precision and processing speed. FPN contains a bottom-up pathway and a top-down pathway with lateral connections. The bottom-up pathway is a backbone ConvNet which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. The top-down pathway produces higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. At the beginning, the top pyramid feature maps are captured by the output of the last convolutional layer of the bottom-up pathway. Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway. While the dimensions of feature maps are different, the 1 × 1 convolutional layer can change the dimension. Once undergoing a lateral connection operation, there will form a new pyramid level and predictions are independently made on each level. Because higher-resolution feature maps are important for detecting small objects while lower-resolution feature maps are rich in semantic information, feature pyramid network extracts significant features.
Another way to improve accuracy is replacing RoI pooling with RoIAlign for extracting a small feature map from each RoI, as shown in Fig.2. Traditional RoI pooling quantizes floating-number in two steps to get approximate feature values in each bin. First, quantization was applied for calculating the coordinate of each RoI in feature maps, given the coordinates of RoIs in the input images and down sampling stride. Then the authors divide RoI feature maps into bins to generate feature maps with the same size, which is also quantized during the process. These two quantization operations cause misalignments between the RoI and the extracted features. To address this, at those two steps, RoIAlign avoids any quantization of the RoI boundaries or bins. First it computes the floating-number of the coordinates of each RoI feature map followed by a bilinear interpolation operation to compute the exact values of the features at four regularly sampled locations in each RoI bin. Then it aggregates the results using max or average pooling to get values of each bin. Fig. 2 is an example of RoIAlign operation.
Experiments showed that with the above two improvements the precision got promotion. Using ResNet-FPN backbone improved 1.7 points box AP and RoIAlign operation improved 1.1 points box AP on MS COCO detection dataset.

E. YOLO
YOLO [7] (you only look once) is a one-stage object detector proposed by Joseph Redmon et al. after Faster R-CNN [6]. The main contribution is real-time detecting full images and webcam. Firstly, it is due to this pipeline only predicts less than 100 bounding boxes per image while Fast R-CNN using selective search predicts 2000 region proposals per image. Secondly, YOLO frames detection as a regression problem, so a unified architecture can extract features from input images straightly for predicting bounding boxes and class probabilities. YOLO base network runs at 45 frames per second with no batch processing on a Titan X GPU as compared to Fast R-CNN at 0.5fps and Faster R-CNN at 7fps.
YOLO pipeline first divides the input image into an S × S grid, where a grid cell is responsible for detecting the object whose center falls into. The confidence scores multiplied by two parts, P (object) denoting the probability of the box contains an object and IOU (intersection over union) showing how accurate the box contain that object. Each grid cell predicts B bounding boxes (x, y, w, h) and confidence scores for them and C-dimension conditional class probabilities for C categories. The feature extraction network contains 24 convolutional layers followed by 2 fully connected layers. When pre-training on ImageNet dataset, the authors use the first 20 convolutional layers and an average pooling layer followed by a fully connected layer. For detection, the whole network is used for better performance. In order to get finegrained visual information improving detection precision, in detection stage double the input resolution of 224 × 224 in pre-training stage.
The experiments showed that YOLO was not good at accurate localization and localization error was the main component of prediction error. Fast R-CNN makes many background false positives mistakes while YOLO is 3 times less than it. Training and testing on PASCAL VOC dataset, YOLO achieves 63.4% mAP with 45 fps as compared to Fast R-CNN (70.0% mAP, 0.5fps) and Faster R-CNN (73.2% mAP, 7fps).

F. YOLOv2
YOLOv2 [28] is a second version of YOLO [7], which adopts many design decisions from past works with novel concepts to improve YOLOs speed and precision.
Batch Normalization. Fixed distribution of inputs to a ConvNet layer would have positive consequences for the layers. It is impractical to normalize the entire training set because the optimization step uses stochastic gradient descent. Since SGD uses mini-batches during training, each minibatch produces estimates of the mean and variance of each activation. Computing the mean and variance value of the mini-batch of size m, then normalize the activations of number m to have mean zero and variance 1. Finally the elements of each mini-batch are sampled from the same distribution. This operation can be seen as a BN layer [29] outputs activations with the same distribution. YOLOv2 add a BN layer ahead of each convolutional layer which accelerates the network to get convergence and helps regularize the model. Batch normalization gets more than 2% improvement in mAP.
High Resolution Classifier. In YOLO backbone, the classifier adopts an input resolution of 224 × 224 then increases the resolution to 448 for detection. This process needs the network adjust to a new resolution inputs when switches to object detection task. To address this, YOLOv2 adds a finetuning process on the classification network at 448 × 448 for 10 epochs on ImageNet dataset which increases the mAP at 4%.
Convolutional With Anchor Boxes. In original YOLO networks, coordinates of predicted boxes are directly generating by fully connected layers. Faster R-CNN uses anchor boxes as reference to generate offsets with predicted boxes. YOLOv2 adopts this prediction mechanism and firstly removes fully connected layers. Then it predicts class and objectness for every anchor box. This operation increases 7% recall while mAP decreases 0.3%.
Predicting the size and aspect ratio of anchor boxes using dimension clusters. In Faster R-CNN the size and aspect ratio of anchor boxes is identified empirically. For easier learning to predict good detections, YOLOv2 uses K-means clustering on the training set bounding boxes to automatically get good priors. Using dimension clusters along with directly predicting the bounding box center location improves YOLO by almost 5% over the above version with anchor boxes.
Fine-Grained Features. For localizing smaller objects, highresolution feature maps can provide useful information. Similar to the identity mappings in ResNet, YOLOv2 concatenates the higher resolution features with the low resolution features by stacking adjacent features into different channels which gives a modest 1% performance increase.
Multi-Scale Training. For networks to be robust to run on images of different sizes, every 10 batches the network randomly chooses a new image dimension size from {320, 352, ..., 608}. This means the same network can predict detections at different resolutions. At high resolution detection, YOLOv2 achieves 78.6% mAP and 40fps as compared to YOLO with 63.4% mAP and 45fps on VOC 2007.
As well, YOLOv2 proposes a new classification backbone namely Darknet-19 with 19 convolutional layers and 5 maxpooling layers which requires less operations to process an image yet achieves high accuracy. The more competitive YOLOv2 version has 78.6% mAP and 40fps as compared to Faster R-CNN with ResNet backbone of 76.4% mAP and 5fps, and SSD500 has 76.8% mAP and 19fps. As mentioned above, YOLOv2 can achieve high detecting precision while high processing rate which benefit from 7 main improvements and a new backbone.

G. YOLOv3
YOLOv3 [30] is an improved version of YOLOv2. First, YOLOv3 uses multi-label classification (independent logistic classifiers) to adapt to more complex datasets containing many overlapping labels. Second, YOLOv3 uses three different scales feature maps to predict the bounding box. The last convolutional layer predicts a 3-d tensor encoding class predictions, objectness, and bounding box. Third, YOLOv3 proposes a deeper and robust feature extractor, called Darknet-53, inspired by ResNet to get deeper.
According to results of experiments on MS COCO dataset, YOLOv3 (AP:33%) performs on par with the SSD variant (DSSD513:AP:33.2%) on MS COCO metrics yet 3 times faster than it while quite a bit behind RetinaNet [31] (AP:40.8%). But uses the old detection metric of mAP at IOU= 0.5 (or AP 50 ), YOLOv3 can achieve 57.9% mAP as compared to DSSD513 of 53.3% and RetinaNet of 61.1%. Due to the advantages of multi-scale predictions, YOLOv3 can detect small objects even more but has comparatively worse performance on medium and larger size objects.

H. RetinaNet
RetinaNet [31] is a one-stage object detector with focal loss as classification loss function proposed by Tsung-Yi Lin et al. [31] in February 2018. The architecture of RetinaNet is shown in Fig.4 (c). R-CNN is a typical two-stage object detector. The first stage generates a sparse set of region proposals and the second stage classifies each candidate location. Owing to the first stage filters out the majority of negative locations, twostage object detectors can achieve higher precision than onestage detectors which propose a dense set of candidate locations. The main reason is the extreme foreground-background class imbalance when one-stage detectors train networks to get convergence. So the authors proposed a loss function, called focal loss, which can down-weight the loss assigned to well-classified or easy examples, focusing on the hard training examples and avoiding the vast number of easy negative examples overwhelming the detector during training. RetinaNet inherits the fast speed of previous one-stage detectors while greatly overcomes one-stage detectors difficult to training for unbalanced positive and negative examples.
Experiments showed that RetinaNet with ResNet-101-FPN backbone got 39.1% AP as compared to DSSD513 of 33.2% AP on MS COCO test-dev dataset. With ResNeXt-101-FPN, it made 40.8% AP far surpassing the-state-of-the-art onestage detector-DSSD513. RetinaNet improved the detection precision on small and medium objects by a large margin.

I. SSD
SSD [8], a single-shot detector for multiple categories within one-stage which directly predicting category scores and box offsets for a fixed set of default bounding boxes of different scales at each location in several feature maps with different scales, as shown in Fig.4 (a). The default bounding boxes have different aspect ratios and scales in each feature map. In different feature maps, the scale of default bounding boxes is computed with regularly space between the highest layer and the lowest layer where each specific feature map learns to be responsive to the particular scale of the objects. For each default box, it predicts both the offsets and the confidences for all object categories. Fig.3 (c) shows the method. At training time, matching these default bounding boxes to the ground truth boxes where the matched default boxes as positive examples and the rest as negatives. For the large amount of default boxes are negatives, the authors adopt hard negative mining using the highest confidence loss for each default box then picking the top ones to make the ratio between the negatives and positives at most 3:1. As well, the authors implement data augmentation which is proved an effective way to enhance precision by a large margin.
Experiments showed that SSD512 had a competitive result both mAP and speed with VGG-16 [24]

J. DSSD
DSSD [32] (Deconvolutional Single Shot Detector) is a modified version of SSD (Single Shot Detector) which adding prediction module and deconvolution module also using ResNet-101 as backbone. The architecture of DSSD is shown in Fig.4 (b). For prediction module, Fu et al. add a residual block to each predicting layer, then do elementwise addition of the outputs of prediction layer and residual block. Deconvolution module is for increasing the resolution of feature maps to strengthen features. Each deconvolution layer followed by a prediction module is to predict a variety of objects with different sizes. At training process, first pretraining ResNet-101 based backbone network on the ILSVRC CLS-LOC dataset, then using 321 × 321 inputs or 513 × 513 inputs training the original SSD model on detection dataset, finally training the deconvolution module freezing all the weights of SSD module. Experiments on both PASCAL VOC dataset and MS COCO dataset show the effectiveness of DSSD513 model, while the added prediction module and deconvolution module bring 2.2% enhancement on PASCAL VOC 2007 test dataset.

K. RefineDet
The whole network [33] contains two inter-connected modules, the anchor refinement module and the object detection module. These two modules are connected by a transfer  connection block to transfer and enhance features from the former module to better predict objects in the latter module. The training process is in an end-to-end way, conducted by three stages, preprocessing, detection (two inter-connected modules) and NMS.
Classical one-stage detectors such as SSD, YOLO, Reti-naNet, etc. all use one-step regression method to obtain the final results. The authors find that use two-step cascaded regression method can better predict hard detected objects, especially for small objects and provide more accurate locations of objects.

L. Relation Networks for Object Detection
Hu et al. [34] propose an adapted attention module for object detection called object relation module which considers the interaction between different targets in an image including their appearance feature and geometry information. This object relation module is added in the head of detector before two fully connected layers to get enhanced features for accurate classification and localization of objects. The relation module not only feeds enhanced features into classifier and regressor, but replaces NMS post-processing step also gain higher accuracy than it. By using Faster R-CNN, FPN and DCN as the backbone network respectively on the COCO test-dev dataset, adding the relationship module increases the accuracy in 0.2, 0.6 and 0.2, respectively.

M. DCNv2
For learning to adapt to geometric variation reflected in the effective spatial support region of targets, deformable convolutional networks DCN [35] was proposed by Jifeng Dai et al. Regular ConvNets can only focus on features of fixed square size (according to the kernal), thus the receptive field does not properly cover the entire pixel of a target object to represent it. The deformable ConvNets can produce deformable kernel and the offset from the initial convolution kernel (of fixed size) are learned from the networks. Deformable RoI Pooling can also adapt to part localization for objects with different shapes. DCNv1 achieves significant accuracy improvements almost 4% enhancement than three plain ConvNets on COCO test-dev set. The best mean average-precision under the strict COCO evaluation criteria (mAP @[0. The blue modules are the layers added in SSD framework whose resolution gradually drop because of down sampling. In SSD the prediction layer is acting on fused features of different levels. Head module consists of a series of convolutional layers followed by several classification layers and localization layers. (b) The red modules are the layers added in DSSD framework denoting deconvolution operation. In DSSD the prediction layer is following every deconvolution module. (c) RetinaNet utilizes ResNet-FPN as its backbone network, which generates P3∼P7 5 level feature pyramid corresponding to C3∼C7 (the feature map of conv3∼conv7 respectively) to predict different sized objects.
Deformable ConvNets v2 [36] utilizes more deformable convolutional layers than DCNv1 from only the conv layers in the conv5 stage to all the conv layers in the conv3-conv5 stages to replace the regular conv layers. All the deformable layers are modulated by a learnable scalar, which obviously enhance the deformable effect and accuracy. The authors adopt feature mimicking to further improve detection accuracy by incorporating a feature mimic loss on the per-RoI features of DCN to be similar to good features extracted from cropped images. DCNv2 achieves 45

N. NAS-FPN
In recent days, the authors from Google Brain adopt neural architecture search to find some new feature pyramid architecture, named NAS-FPN [16], consisting of both top-down and bottom-up connections to fuse features with a variety of different scales. By repeating FPN architecture N times and then concatenating them into a large architecture during the search, the high level feature layers pick which level features for them to imitate. All of the highest accuracy architectures have the connection between high resolution input feature maps and output feature layers, which indicate that it is necessary to generate high resolution features for detecting small targets. Stacking more pyramid networks, adding feature dimension, adopting high capacity architecture all increase detection accuracy by a large margin. Experiments show that adopting ResNet-50 as backbone with 256 feature dimension, NAS-FPN surpass the original FPN 2.9% mean average-precision on COCO test-dev dataset. The superlative configuration of NAS-FPN is utilizing AmoebaNet as backbone network and stacking 7 FPN with 384 feature dimension, which achieves 48.0% on COCO test-dev.

O. M2Det
To meet a large variety of scale variation across object instances, Zhao et al. [37] propose a multi-level feature pyramid network (MLFPN) constructing more effective feature pyramids. The authors adopt three steps to obtain final enhanced feature pyramids. First, like FPN, multi-level features extracted from multiple layers in the backbone are fusing as the base feature. Second, the base feature is fed into a block, composing of alternating joint Thinned U-shape Modules and Feature Fusion Modules, and obtains the decoder layers of TUM as the features for next step. Finally, the decoder layers with equivalent scales are gathered up to construct a feature pyramid containing multi-level features. So far, features with multi-scale and multi-level are prepared. The remaining part is to follow the SSD architecture to obtain bounding box localization and classification results in an end-to-end manner. For M2Det is an one-stage detector, it achieves AP of 41.0 at speed of 11.8 FPS with single-scale inference strategy and AP of 44.2 with multi-scale inference strategy utilizing VGG-16 on COCO test-dev set. It outperforms RetinaNet800 (Res101-FPN as backbone) by 0.9% with single-scale inference strategy, but is twice slower than RetinaNet800.
In conclusion, the typical baselines enhance accuracy by extracting richer features of objects and adopting multi-level and multi-scale features for different sized object detection. To achieve higher speed and precision, the one-stage detectors utilize newly designed loss function to filter out easy samples which drops the number of proposal targets by a large margin. To address geometric variation, adopting deformable convolution layers is an effective way. Modeling the relationship between different objects in an image is also necessary to improve performance. Detection results on MS COCO testdev dataset of the above typical baselines are listed on table 2.

IV. DATASETS AND METRICS
Detecting an object has to state that an object belongs to a specified class and localize it in the image. The localization of an object is typically represented by a bounding box as in Fig. 5. Using challenging datasets as benchmark is significant in many areas of research, because they are able to draw a standard comparison between different algorithms and set goals for solutions. Early algorithms focused on face detection using various ad hoc datasets. Later, more realistic and challenging face detection datasets were created. Another popular challenge is the detection of pedestrians for which several datasets have been created. The Caltech Pedestrian Dataset [1] contains 350,000 labeled instances with bounding boxes. General object detection datasets like PASCAL VOC [4], MS COCO [5], ImageNet-loc [3] are the mainstream benchmarks of object detection task. The official metrics are mainly adopted to measure the performance of detectors with corresponding dataset.
A. PASCAL VOC dataset 1) Dataset: For the detection of basic object categories, a multi-year effort from 2005 to 2012 was devoted to the creation and maintenance of a series of benchmark datasets that were widely adopted. The PASCAL VOC datasets [4] contain 20 object categories (in VOC2007, such as person, bicycle, bird, bottle, dog, etc.) spread over 11,000 images. The 20 categories can be considered as 4 main branches-vehicles, animals, household objects and people. Some of them increase semantic specificity of the output, such as car and motorbike, different types of vehicle, but not look similar. In addition, the visually similar classes increase the difficulty of detection, e.g. dog vs. cat. Over 27,000 object instance bounding boxes are labeled, of which almost 7,000 have detailed segmentations. Imbalanced datasets exist in the VOC2007 dataset, while the class person is definitely the biggest one, which is nearly 20 times more than the smallest class sheep in the training set. This problem is widespread in the surrounding scene, how can detectors solve this well? Another issue is viewpoint, such as, front, rear, left, right and unspecified, the detectors need to treat different viewpoints separately. Some annotated examples are showed in the last two lines of Fig. 5.
2) Metric: For the VOC2007 criteria, the interpolated average precision (Salton and McGill 1986) was used to evaluate both classification and detection. It is designed to penalize the algorithm for missing object instances, for duplicate detections of one instance, and for false positive detections.
where t is threshold to judge the IoU between predicted box and ground truth box. In VOC metric, t is set to 0.5.   [5] for detecting and segmenting objects found in everyday life in their natural environments contains 91 common object categories with 82 of them having more than 5,000 labeled instances. These categories cover the 20 categories in PASCAL VOC dataset. In total the dataset has 2,500,000 labeled instances in 328,000 images. MS COCO dataset also pays attention to varied viewpoints and all objects of it are in natural environments which gives us rich contextual information.
In contrast to the popular ImageNet dataset [3] COCO has  [5]. The images show three different types of images sampled in the dataset, including iconic objects, iconic scenes and non-iconic objects. In addition, the last two lines are annotated sample images from the PASCAL VOC dataset [4]. fewer categories but more instances per category. The dataset is also significantly larger in the number of instances per category (27k on average) than the PASCAL VOC datasets [4] (about 10 more times less than MS COCO dataset) and ImageNet object detection dataset (1k) [3]. MS COCO contains considerably more object instances per image (7.7) as compared to PASCAL VOC (2.3) and ImageNet (3.0). Furthermore, MS COCO dataset contains 3.5 categories per image as compared to PASCAL (1.4) and ImageNet (1.7) on average. In addition, 10% images in MS COCO have only one category, while in ImageNet and PASCAL VOC all have more than 60% of images contain a single object category. As we know, small objects need more contextual reasoning to recognize. Images among MS COCO dataset are rich in contextual information. The biggest class is also the person, nearly 800,000 instances, while the smallest class is hair driver, about 600 instances in the whole dataset. Another small class is hair brush whose number is nearly 800. Except for 20 classes with many or few instances, the number of instances in the remaining 71 categories is roughly the same. Three typical categories of images in MS COCO dataset are showed in the first two lines of Fig. 5.
2) Metric: MS COCO metric is under a strict manner and thoroughly judge the performance of detections. The threshold in PASCAL VOC is set to a single value, 0.5, but is belong to [0.5,0.95] with an interval 0.05 that is 10 values to calculate the mean average precision in MS COCO. Also the special average precision for small, medium and large objects are calculated separately.
C. ImageNet benchmark 1) Dataset: Challenging datasets can encourage a step forward of vision tasks and practical applications. Another important large-scale benchmark dataset is ImageNet dataset [3]. The ILSVRC task of object detection evaluates the ability of an algorithm to name and localize all instances of all target objects present in an image. ILSVRC2014 has 200 object classes and nearly 450k training images, 20k validation images and 40k test images. More comparisons with PASCAL VOC are in Table 3.
2) Metric: The PASCAL VOC metric uses the threshold t = 0.5. However, for small objects even deviations of a few pixels would be unacceptable according to this threshold. ImageNet uses a loosen threshold calculated as: t = min(0.5, wh (w + 10)(h + 10) ) where w and h are width and height of a ground truth box respectively. This threshold allow for the annotation to extend up to 5 pixels on average in each direction around the object.

D. VisDrone2018 benchmark
Last year, a new dataset consists of images and videos captured by drones. VisDrone2018 [50], a large-scale visual object detection and tracking benchmark dataset, which is aiming at advancing visual understanding tasks on the drone platform. The images and video sequences in the benchmark were captured over various urban/suburban areas of 14 different cities across China from north to south. Specifically, VisDrone2018 consists of 263 video clips and 10,209 images (no overlap with video clips) with rich annotations, including object bounding boxes, object categories, occlusion, truncation ratios, etc. This benchmark has more than 2.5 million annotated instances in 179,264 images/video frames. Being the largest such dataset ever published, the benchmark enables extensive evaluation and investigation of visual analysis algorithms on the drone platform. VisDrone2018 has a large amount of small objects, such as dense cars, pedestrians and bicycles, which will cause difficult detection about certain categories. Moreover, a large proportion of the images in this dataset have more than 20 objects per image, 82.4% in training set, and the average number of objects per image is 54 in 6471 images of training set. This dataset contains dark night scenes so the brightness of these images lower than those in day time, which complicates the correct detection of small and dense objects, as shown in Fig. 6. This dataset adopts MS COCO metric.
E. Open Images V5 1) Dataset: Open Images [51] is a dataset of 9.2M images annotated with image-level labels, object bounding boxes, object segmentation masks, and visual relationships. Open Images V5 contains a total of 16M bounding boxes for 600 object classes on 1.9M images, which makes it thelargest existing dataset with object location annotations. First, the boxes in this dataset have been largely manually drawn by professional annotators (Google-internal annotators) to ensure accuracy and consistency. Second, the images in it are very diverse and mostly contain complex scenes with several objects (8.3 per image on average). Third, this dataset offers visual relationship annotations, indicating pairs of objects in particular relations (e.g. "woman playing guitar", "beer on table"). In total it has 329 relationship triplets with 391,073 samples. Fourth, V5 provides segmentation masks for 2.8M object instances in 350 classes. Segmentation masks mark the outline of objects, which characterizes their spatial extent to a much higher level of detail. Finally, the dataset is annotated with 36.5M imagelevel labels spanning 19,969 classes.
2) Metric: On the basis of PASCAL VOC 2012 mAP evaluation metric, Kuznetsova et al. propose several modifications to consider thoroughly of some important aspects of the Open Images Dataset. First, for fair evaluation, the unannotated classes are ignored for avoiding wrongly counted as false negatives. Second, if an object belongs to a class and a subclass, an object detection model should give a detection result for each of the relevant classes. The absence of one of these classes would be considered a false positive in that class. Third, in Open Images Dataset, there exists group-of boxes which contain a group of (more than one which are occluding each other or physically touching) object instances but unknown a single object localization inside them. If a detection inside a group-of box and the intersection of the detection and the box divided by the area of the detection is larger than 0.5, the detection will be counted as a true positive. Multiple correct detections inside the same group-of box only count one valid true positive. Table 4 and table 5 list the comparison between several people detection benchmarks and pedestrian detection datasets, respectively.

METHODS
Deep neural network based object detection pipelines have four steps in general, image pre-processing, feature extracting, classification and regression, post-processing. Firstly, raw images from the dataset cant be fed into the network directly. Thus, we need to resize them to any special sizes and make them clearer, such as enhancing brightness, color, contrast, etc. Data augmentation is also available for some requirements, such as flipping, rotation, scaling, cropping, translation, adding Gaussian noise. Furthermore, GANs [59] (generative adversarial networks) can generate new images as you want to enrich diversity of inputs. For more details about data augmentation, please refer to [60] for more details. Secondly, feature extracting is a key step for further detection. The feature quality directly determines the upper bound of subsequent tasks which contain classification and regression. Thirdly, the detector head is responsible for proposing and refining bounding box concluding classification scores and bounding box coordinates. Fig. 1 illustrates the basic procedure of the second and the third step. At last, the post-processing step deletes any weak detecting results. For example, NMS is a widely used method in which the highest scoring object deletes its nearby objects with inferior classification scores.
To obtain precise detection results, there exists several ways of which one can be used alone or in combination with the other.

A. Enhanced features
Extracting effective features from input images is a vital prerequisite for further accurate classification and localization  steps. To fully utilize the output feature maps of consecutive backbone layers, Lin et al. [13] aim to extract richer features by dividing them into different levels to detect objects of different sizes, as shown in Fig. 3  propose a parallel feature pyramid (FP) network (PFPNet), where the FP is constructed by widening the network width instead of increasing the network depth. The additional feature transformation operation is to generate a pool of feature maps with different sizes, which yields the feature maps with similar levels of semantic abstraction across the scales. Li et al. [65] concatenate features from different layers with different scales and then generates new feature pyramid to feed into multibox detectors predicting the final detection results. Chen et al. [66] propose WeaveNet iteratively weaves context information from adjacent scales together to enable more sophisticated context reasoning. Zheng et al. [67] extend better context information for the shallow layers of one-stage detector [8].
Semantic relationships between different objects or regions of an image can help detect occluded and small objects. Bae et al. [68] utilize the combined and high-level semantic features for object classification and localization which is combining the multi-region features stage by stage. Zhang et al. [33] utilize a semantic segmentation branch and a global activation module to enrich the semantics of object detection features within a typical deep detector. Scene contextual relations [69] can provide some useful information for accurate visual recog-nition, Liu et al. [70] adopt scene contextual information to further improve accuracy. Modeling relations between objects can help object detection. Singh et al. [71] process context regions around the ground-truth object on an appropriate scale. Hu et al. [34] propose a relation module that processes a set of objects simultaneously considering both appearance and geometry features through interaction. Mid-level semantic properties of objects can benefit object detection containing visual attributes [72].
Attention mechanism is an effective method for networks focusing on the most significant region part. Some typical works [ [80] design an architecture combining both global attention and local reconfigurations so as to gather task-oriented features across different spatial locations and scales.
Fully utilizing the effective region of one object can promote the accuracy. Original ConvNets can only focus on features of fixed square size (according to the kernal), thus the receptive field does not properly cover the entire pixel of a target object to represent it. The deformable ConvNets can produce deformable kernel and the offset from the initial convolution kernel (of fixed size) are learned from the networks. Deformable RoI Pooling can also adapt to part localization for objects with different shapes. In [35] [36], network weights and sampling locations jointly determine the effective support region.
Above all, richer and proper representations of an object can promote the detecting accuracy remarkably. Brain-inspired mechanism is a powerful way to further enhance detection performance.

B. Increasing localization accuracy
Localization and classification are two missions of object detection. Under object detection evaluation metrics, the precision of localization is a vital measurable indicator, thus increasing localization accuracy can promote detection performance remarkably. Designing a novel loss function to measure the accuracy of predicted boxes is an effective way to increase localization accuracy. Considering intersection over union is the most commonly used evaluation metric of object detection, estimating regression quality can judge the IoU between predicted bounding box and its corresponding assignment ground truth box. For two bounding boxes, IoU can be calculated as the intersection area divided by the union area.
A typical work [81] adopts IoU loss to measure the degree of accuracy the network predicting. This loss function is robust to varied shapes and scales of different objects and can converge well in a short time. Rezatofighi et al. [82] incorporate generalized IoU as a loss function and a new metric into existing object detection pipeline which makes a consistent improvement than the original smooth L1 loss counterpart. Tychsen et al. [47] adopt a novel bounding box regression loss for localization branch. IoU loss in this research considers the intersection over union between predicted box and assigned ground truth box which is higher than a preset threshold but not concludes only the highest one. He et al. [83] propose a novel bounding box regression loss for learning bounding box localization and transformation variance together. He et al. [84] propose a novel bounding box regression loss which has a strong connection to localization accuracy. Pang et al. [63] propose a novel balanced L1 Loss to further improving localization accuracy. Cabriel et al. [85] propose Axially Localized Detection method to achieve a very high localization precision at the cellular level. In general, researchers design new loss function of localization branch to make the retained predictions more accurate.

C. Solving negatives-positives imbalance issue
The two-stage detectors have a mainly well designed step that is the first stage producing proposals and filtering out a large number of negative samples. When feed into the detector the proposal bounding boxes belong to a sparse set. However, in a one-stage detector, the network has no steps to filter out bad samples, thus the dense sample sets are difficult to train. The proportion of positive and negative samples is extremely unbalanced as well. The typical solution is hard negative mining [86] The popularized hard mining methods OHEM [38] can help driving the focus towards hard samples. Liu et al. [8] adopt hard negative mining method which sorts all of the negative samples using the highest confidence loss for each pre-defined boxes and picking the top ones to make the ratio between the negative and positive samples at most 3:1. Considering hard samples is more effective to improve the detection performance when training an object detector. Pang et al. [63] propose a novel hard mining method called IoU-balanced sampling. Yu et al. [87] concentrate on real-time requirements.
Another effective way is adding some items in classification loss function. Lin et al. [31] propose a loss function, called focal loss, which can down-weight the loss assigned to wellclassified or easy examples, focusing on the hard training examples and avoiding the vast number of easy negative examples overwhelming the detector during training. Chen et al. [88] consider design a novel ranking task to replace the conventional classification task and a newly Average-Precision loss for this task, which can alleviate the extreme negativepositive class imbalance issue remarkably.

D. Improving post-processing NMS methods
Only one detected object can be successfully matched to a ground truth object which will be preserved as a result, while others matched to it are classified as duplicate. NMS (nonmaximum suppression) is a heuristic method which selects only the object of the highest classification score otherwise will be ignored. Hu et al. [34] can use its intermediate results produced by relation module to better determine which objects will be saved while it doesnt need NMS. NMS considers the classification score but the localization confidence is absent, which causes less accurate in deleting weak results. Jiang et al. [89] propose IoU-Net learning to predict the IoU between each detected bounding box and the matched ground-truth. Because of its consideration of localization confidence, it improves the NMS method by preserving accurately localized bounding boxes. Tychsen et al. [47] propose a novel fitness NMS method which considers both greater estimated IoU overlap and classification score of predicted bounding boxes. Liu et al. [90] propose adaptive-NMS which applies a dynamic suppression threshold to an instance decided by the target density. Bodla et al. [44] propose an improved NMS method without any extra training and is simple to implement. He et al. [84] further improve soft-NMS method. Jan et al. [91] feed network score maps resulting from NMS at multiple IoU thresholds. Hosang et al. [92] design a novel ConvNets which does NMS directly without a subsequent post-processing step. Yu et al. [87] utilize the final feature map to filter out easy samples so the network concentrates on hard samples.

E. Combining one-stage and two-stage detectors to make good results
In general, pre-existing object detectors are divided into two categories, the one is two-stage detector, the representative one, [6]. The other is one-stage detector, such as [7], [8]. Twostage detectors have high localization and object recognition precision, while the one-stage detectors achieve high inference and test speed. The two stages of two-stage detectors are divided by ROI (Region of Interest) pooling layer. In Faster R-CNN detector, the first stage, called RPN, a Region Proposal Network, proposes candidate object bounding boxes. The second stage, the network extracts features using RoIPool from each candidate box and performs classification and boundingbox regression.
To fully inherit the advantages of one-stage and two-stage detectors while overcoming their disadvantages, Zhang et al. [33] propose a novel RefineDet which achieves better accuracy than two-stage detectors and maintains comparable efficiency of one-stage detectors.

F. Complicated scene solutions
Object detection always meets some challenges like small objects hard to detect and heavy occluded situation. Due to low resolution and noisy representation, detecting small objects is a very hard problem. Object detection pipelines [8] [31] detect small objects through learning representations of the objects at multiple scales. Some works [93][94] [95] improve detection accuracy on the basis of [8]. Li et al. [96] utilize GAN model in which generator transfer perceived poor representations of the small objects to super-resolved ones that are similar enough to real large objects to fool a competing discriminator. This makes the representation of small objects similar to the large one thus improves accuracy without heavy computing cost. Some methods [45] [97] improve detection accuracy of small objects by enhancing IoU thresholds to train multiple localization modules. Hu et al. [98] utilize feature fusion to better detect small faces which is produced by image pyramid. Xu et al. [99] fuse high level features with rich semantic information and low level features via Deconvolution Fusion Block to enhance representation of small objects.
Target occlusion is another difficult problem in the field of object detection. Wang et al. [100] improve the recall of the face detection problem in the occluded case without speed decay. Wang et al. [101] propose a novel bounding box regression loss specifically designed for crowd scenes, called repulsion loss. Zhang et al. [102] propose a newly designed occlusion-aware R-CNN (OR-CNN) to improve the detection accuracy in the crowd. Baqu et al. [103] [71] adaptively sample regions from multiple scales of an image pyramid, conditioned on the image content. Secondly, researchers use convolutional filters of multiple scales on the feature maps. For instance, in [106], models of different aspect ratios are trained separately using different filter sizes (such as 5 × 7 and 7 × 5 ). Thirdly, predefined anchors with multi-scales and multiple aspect ratios are reference boxes of the predicted bounding boxes. Faster R-CNN [6] and SSD [8] are used in two-stage and one-stage detectors for the first time, respectively. Fig. 7 is a schematic diagram of the above three cases.

G. anchor-free
While there are constellation anchor-based object detectors being mainstream method which contain both one-stage and two-stage detectors making significant performance improvements, such as SSD, Faster R-CNN, YOLOv2, YOLOv3, they still suffer some drawbacks.
(1) The pre-defined anchor boxes have a set of hand-crafted scales and aspect ratios which are sensitive to dataset and affect the detection performance by a large margin.
(2) The scales and aspect ratios of pre-defined anchor boxes are kept fixed during training, thus the next step cant get adaptively adjust boxes. Meanwhile, detectors have trouble handling objects of all sizes.
(3) For densely place anchor boxes to achieve high recall, especially on large-scale dataset, the computation cost and memory requirements bring huge overhead during processing procedure.
(4) Most of pre-defined anchors are negative samples, which causes great imbalance between positive and negative sample during training.
To address that, recently propose a series of anchor-free methods [ [107]. CenterNet [109] localizes the center point, top-left and bottom-right point of an object. Tian et al. [61] propose a localization method which is based on the four distance values between the predicted center point and four sides of a bounding box. The general structure of the anchor-based approach is shown in Fig. 8. It is still a novel direction for further research.

H. Training from scratch
Almost all of state-of-the-art detectors utilize off-the-shelf classification backbone pre-trained on large scale classification dataset [3] as their initial parameter set then fine-tune parameters to adapt to the new detection task. Another way to implement training procedure is that all parameters are assigned from scratch. Zhu et al. [114] train detector from scratch thus dont need pre-trained classification backbone because of stable and predictable gradient brought by batch normalization operation. Some works [115] [116] [117] [118] train object detectors from scratch by dense layer-wise connections.

I. Designing new architecture
Because of different propose of classification and localization task, there exists a gap between classification network and detection architecture. Localization needs fine-grained representations of objects while classification needs high semantic information. Li et al. [14] propose a newly designed object detection architecture to specially focus on detection task which maintains high spatial resolution in deeper layers and doesnt need to pre-train on large scale classification dataset.
The two-stage detectors always slower than one-stage detectors. By studying the structure of the two-stage network, researchers find two-stage detectors like Faster R-CNN and R-FCN have a heavy head which slows it down. Li et al. [119] propose a light head two-stage detector to keep time efficiency.

J. Speeding up detection
For limited computing power and memory resource such as mobile devices, real-time devices, webcam, automatic driving encourage studies on efficient detection architecture design. The most typical real-time detector is the [

K. Achieving Fast and Accurate Detections
The best object detector needs both high efficiency and high accuracy which is the ultimate goal of this task. Lin et al. [31] aim to surpass the accuracy of existing two-stage detectors as well as maintain fast speed. Zhou et al. [125] combine an accurate (but slow) detector and a fast (but less accurate) detector adaptively determining whether an image is easy or hard to detect and choosing an appropriate detector to detect it. Liu et al. [126] build a fast and accurate detector by strengthening lightweight network features using receptive fields block.

A. Typical application areas
Object detection has been widely used in some fields to assist people to complete some tasks, such as security field, military field, transportation field, medical field and life field etc. We describe the typical and recent methods utilized in these fields in detail.
1) Security field: The most well known applications in the security field are face detection, pedestrian detection, fingerprint identification, fraud detection, anomaly detection etc.
Face detection aims at detecting people faces in an image, as shown in Fig. 9. Because of extreme poses, illumination and resolution variations, face detection is still a difficult mission. Many works focus on precise detector designing. R. Ranjan et al. [127] learn correlated tasks (face detection, facial landmarks localization, head pose estimation and gender recognition) simultaneously to boost the performance of individual tasks. He et al. [128] propose a novel Wasserstein convolutional neural network approach for learning invariant features between near-infrared (NIR) and visual (VIS) face images. Designing appropriate loss functions can enhance  [132] achieve great success in deep learning based face recognition. Deng et al. [133] propose an Additive Angular Margin Loss (ArcFace) to get highly discriminative features for face recognition. Please refer to [134] for more details.
Pedestrian detection focuses on detecting pedestrians in the natural scenes. Braun et al. [52] release an EuroCity Persons dataset containing pedestrians, cyclists and other riders in urban traffic scenes. Complexity-aware cascaded pedestrian detectors [135] [136][137] focus on real time pedestrian detection. Please refer to a survey [138] for more details.
Anomaly detection plays an significant role in fraud detection, climate analysis, and healthcare monitoring. Existing anomaly detection techniques [139] [140][141] [142] analyze the data on a point-wise basis. To point the expert analysts to the interesting regions (anomalies) of the data, Barz et al. [143] propose a novel unsupervised method called Maximally Divergent Intervals (MDI), which searches for contiguous intervals of time and regions in space.
2) Military field: In military field, remote sensing object detection, topographic survey, flyer detection, etc. are representative applications.
Remote sensing object detection aims at detecting objects on remote sensing images or videos, which meets some challenges. Firstly, the extreme large input size but small targets makes the existing object detection procedure too slow for practical use and too hard to detect. Secondly, the massive and complex backgrounds cause serious false detection. To address these issues, researchers adopt data fusion and focus on detecting small objects for their less information and small deviation causing huge inaccuracy. Remote sensing images have some characteristics far from natural images, thus those strong pipelines such as Faster R-CNN, FCN, SSD, YOLO etc. cant transfer well in the new data domain. Designing remote sensing dataset adapted detectors remains a research hot spot in this domain.
Cheng et al. [144] propose a CNN-based Remote Sensing Image (RSI) object detection model dealing with the rotation problem by proposing a rotation-invariant layer. Zhang et al. [145] propose a rotation and scaling robust structure to address lacking rotation and scaling invariance in RSI object detection domain. Li et al. [146] propose a rotatable region proposal network and a rotatable detection network considering the orientation of vehicles. Deng et al. [147] propose an accuratevehicle-proposal-network (AVPN) for small object detection. Audebert et al. [148] utilize accurate semantic segmentation results to obtain detection of vehicles. Li et al. [149] address large range of resolutions of ships (ranging from dozens of pixels to thousands) issue in ship detection. Pang et al. [150] propose a real-time remote sensing method. Pei et al. [151] propose a deep learning framework on synthetic aperture radar (SAR) automatic target recognition. Long et al. [152] concentrate on automatic accurate localization of detected objects. Shahzad et al. [153] propose a novel framework containing automatic labeling and recurrent neural network for detection.
3) Transportation field: As we known that, license plate recognition, automatic driving and traffic sign recognition etc. greatly facilitate people's life.
With the widespread use of vehicles, license plate recognition is required in tracking crime, residential access, traffic violations tracking etc. edge information, mathematical morphology, texture features, sliding concentric windows, connected component analysis etc. can bring license plate recognition system more robust and stable. Recently, deep learning-based methods [165][166] [167][168] [169] provide a variety of solutions for license plate recognition. Please refer to [170] for more details.
An autonomous vehicle (AV) needs an accurate perception of its surroundings to operate reliably. The perception system of an AV normally employs machine learning (e.g., deep learning) and transforms sensory data into semantic information that enables autonomous driving. Object detection is a fundamental function of this perception system. 3D object detection methods involve a third dimension that reveals more detailed object's size and location information, which are divided into three categories, monocular, point-cloud and fusion. First, monocular image based methods predict 2D bounding boxes on the image then extrapolate them to 3D, which lacks explicit depth information so limits the accuracy of localization. Second, point-cloud based methods project point clouds into a 2D image to process or generate a 3D representation of the point cloud directly in a voxel structure, where the former loses information and the latter is time consuming. Third, fusion based methods fuse both front view images and point clouds to generate a robust detection, which represent state-of-theart detectors while computationally expensive. Recently, Lu et al. [171] utilize a novel architecture contains 3D convolutions and RNNs to achieve centimeter-level localization accuracy in different real-world driving scenarios. Song et al. [172] release a 3D car instance understanding benchmark for autonomous driving. Banerjee et al. [173] utilize sensor fusion to obtain better features. Please refer to an recently survey [174] for more details.
Both unmanned vehicles and autonomous driving systems require solving the problem of traffic sign recognition. For the sake of safety and obeying the rules, real-time accurate traffic sign recognition assists in driving by acquiring the temporal and spatial information of the potential signs. Deep learning methods [175][176] [177][178] [179][180] [181] solve this problem with high performance. 4) Medical field: In medical field, medical image detection, cancer detection, disease detection, skin disease detection and healthcare monitoring etc. have become a means of supplementary medical treatments increasingly. Computer Aided Diagnosis (CAD) systems can support physicians in classifying different kinds of cancer. In detail, after an appropriate acquisition of the images, the fundamental steps carried out by a CAD framework can be identified as image segmentation, feature extraction, classification and object detection. Due to significant individual differences, data scarcity and privacy, there usually exists data distribution difference between source domain and target domain. A domain adaptation framework [182] is needed for medical image detection.
Li et al. [77] incorporate the attention mechanism in CNN for glaucoma detection and establish a large-scale attentionbased glaucoma dataset. Liu et al. [183] design a bidirectional recurrent neural network (RNN) with long short-term memory (LSTM) to detect DNA modifications called DeepMod. Schubert et al. [184] propose cellular morphology neural networks (CMNs) for automated neuron reconstruction and automated detection of synapses. Codella et al. [185] organize a challenge of skin lesion analysis toward melanoma detection. Please refer to [186] for more details.
5) Life field: In life field, intelligent home, commodity detection, event detection, pattern detection, rain/shadow detection etc. are the most representative applications.
On densely packed scenes like retail shelf displays, Eran Goldman et al. [187] propose a novel precise object detector and a new SKU-110K dataset to meet this challenge.
Event detection aims to discover real-world events from the Internet such as festivals, talks, protests, natural disasters, elections etc. with the popularity of social media and its new characters, the data type of which are more diverse than before. Multi-domain event detection (MED) provides comprehensive descriptions of events. Yang et al. [188] present an event detection framework to dispose multi-domain data. Wang et al. [189] incorporate online social interaction features by constructing affinity graphs for event detection tasks. Schinas et al. [190] propose a multimodal graph-based system to detect events from 100 million photos/videos. Please refer to a survey [191] for more details.
Yang et al. [206] present a novel rain model accompany with a deep learning architecture to address rain detection in a single image. Hu et al. [207] analyze the spatial image context in a direction-aware manner and design a novel deep neural network to detect shadow.

B. Object detection branches
Object detection has a wide range of application scenarios. The research of this domain contains a large variety of branches. We describe some representative branches in this part.
1) Weakly supervised object detection: Weakly supervised object detection (WSOD) aims at utilizing a few fully annotated images (supervision) to detect the large amount of non-fully annotated ones. Traditionally models are learned from images labelled only with the object class and not the object bounding box. Annotating a bounding box for each object in large datasets is expensive, laborious and impractical. Weakly supervised learning relies on incomplete annotated training data to learn detection models. Weakly supervised deep detection networks [208] is a representative work for weakly supervised object detection. Context information [209], instance classifier refinement [210] and image segmentation [211][212] are adopted to tackle hardly optimized problems. Yang et al. [213] show that the action depicted in the image could provide strong cues about the location of the associated object. Wan et al. [214] propose a min-entropy latent model optimized with a recurrent learning algorithm for weakly supervised object detection. Tang et al. [215] utilize an iterative procedure to generate proposal clusters and learn refined instance classifiers, which makes the network concentrate on the whole object rather than parts of the object. Cao et al. [216] propose a novel feedback convolutional neural network for weakly supervised object localization. Wan et al. [217] propose continuation multiple instance learning to alleviate the non-convexity problem in WSOD.
2) Salient object detection: Salient object detection utilizes deep neural network to predict saliency scores of image regions and obtain accurate saliency maps, as shown in Fig. 10. Salient object detection networks usually need to aggregate multi-level features of backbone network. For fast speed without accuracy dropping, Wu et al. [218] present that discarding the shallower layer features can achieve fast speed and the deeper layer features are sufficient to obtain precisely salient map. Liu et al. [219] expand the role of pooling in convolutional neural networks. Wang et al. [220] utilize fixation prediction to detect salient objects. Wang et al. [221] utilize recurrent fully convolutional networks and incorporate saliency prior knowledge for accurate salient object detection. Feng et al. [222] propose an attentive feedback module to better explore the structure of objects. Video salient object detection datasets [223] 3) Highlight detection: Highlight detection is to retrieve a moment in a short video clip that captures a users primary attention or interest, which can accelerate browsing many   [247] are domain-specific for they are tailored to a category of videos. All object detection tasks require a large amount of manual annotation data, highlight detection is no exception. Xiong et al. [248] propose a weakly supervised method on shorter user-generated videos to address this issue.

4) Edge detection:
Edge detection aims at extracting object boundaries and perceptually salient edges from images, which is important to a series of higher level vision tasks like segmentation, object detection and recognition. Edge detection meets some challenges. First, a variety of scale of edges in an image which needs both object-level boundaries and useful local region details. Second, Conv layers of different levels are specialized to predict different parts of the final detection, thus each layer in CNN shall be trained by proper layer-specific supervision. To address these issues, He et al. [249] propose a Bi-Directional Cascade Network to let one layer supervised by labeled edges while adopt dilated convolution to generate multi-scale features. Liu et al. [250] propose an accurate edge detector which utilizes richer convolutional features.

5) Text detection:
Text detection aims at identifying text regions of given images or videos which is also an important prerequisite for many computer vision tasks, such as classification, video analysis etc. There have been many successful commercial optical character recognition (OCR) systems for internet content and documentary texts recogni-tion. The detection of text in natural scenes is still a challenge due to complex situations such as blurring, uneven lighting, perspective distortion, various orientation, etc. Some typical works [ [258] is a problem that needs to be solved. In general, deep learning based scene text detection methods can be classified into two categories. The first category takes scene text as a type of general object, following the general object detection paradigm and locating scene text by text box regression. These methods have difficulties to deal with the large aspect ratios and arbitrary-orientation of scene text. The second one directly segments text regions, but mostly requires complicated post-processing step. Usually, some methods in this category mainly involve two steps, segmentation (generate text prediction maps) and geometric approaches (for inclined proposals), which is time-consuming. In addition, in order to obtain the desired orientation of text boxes, some methods require complex post-processing step, so it's not as efficient as those architectures that are directly based on detection networks.
Lyu et al. [257] combine the ideas of the two categories above while avoiding their shortcomings by localizing corner points of text bounding boxes and segmenting text regions in relative positions to detect scene text, which can handle long oriented text and only need a simple NMS post-processing step. Ma et al. [258] develop a novel rotation-based approach and an end-to-end text detection system in which Rotation Region Proposal Networks (RRPN) for generating inclined proposals with text orientation angle information.
6) Multi-domain object detection: Domain-specific detectors always achieve high detection performance on the specified dataset. So as to get a universal detector which is capable of working on various image domains, recently many works are focus on training a multi-domain detector while dont require prior knowledge of the newly domain of interest. Wang et al. [259] propose a universal detector which utilizes a new domain-attention mechanism working on a variety of image domains (human faces, traffic signs and medical CT images) without prior knowledge of the domain of interest. Wang et al. [259] propose a newly established universal object detection benchmark consisting of 11 diverse datasets to better meet the challenges of generalization in different domains.
For learning a universal representation for vision, Bilen et al. [260] add domain-specific BN (batch normalization) layers to a multi-domain shared network. Rebuffi et al. [261] propose adapter residual modules which achieves a high degree of parameter sharing while maintaining or even improving the accuracy of domain-specific representations. Rebuffi et al. [261] introduce the Visual Decathlon Challenge, a benchmark contains ten very different visual domains. Inspired by the transfer learning, Rebuffi et al. [262] empirically study efficient parameterizations and outperform traditional fine-tuning techniques.
Another requirement for multi-domain object detection is reducing annotation costs. Object detection datasets need heavily annotation works which is time consuming and mechanical. Transferring pre-trained models from label-rich domains to label-poor datasets can solve label-poor detection works. One way is utilizing unsupervised domain adaptation methods to tackle the dataset bias problems. Recently researchers use adversarial learning to align the source and target distributions of samples. Chen et al. [263] utilize Faster R-CNN with a domain classifier trained to distinguish source and target samples, like adversarial learning, while the feature extractor learns to deceive the domain classifier. Saito et al. [264] propose a weak alignment model to focus on similarity between different images from domains with large discrepancy rather than aligning images that are globally dissimilar. When only in the source domain manual annotations are available, Unsupervised Domain Adaptation methods is to address this issue. Haupmann et al. [265] propose a Unsupervised Domain Adaptation method which models both the intra-class and the inter-class domain discrepancy. 7) Object detection in videos: Object detection in videos aims at detecting objects in videos, which brings additional challenges due to degraded image qualities such as motion blur and video defocus, leading to unstable classifications for the same object across video. Video detectors [ [269] first detect objects in each frame and then check them by linking detections of the same object in neighbor frames. Due to object motion, the same object in neighbor frames may not have a large overlap. On the other hand, the predicted object movements not accurate enough to link neighbor frames. Tang et al. [276] propose an architecture which links objects in the same frame instead of neighboring frames. 8) Point clouds 3D object detection: Compared to image based detection, LiDAR point cloud provides reliable depth information that can be used to accurately localize objects and characterize their shapes. In autonomous navigation, autonomous driving, housekeeping robots and augmented/virtual reality applications, LiDAR point cloud based 3D object detection plays an important role. Point cloud based 3D object detection meets some challenges, for LiDAR point clouds are sparse, highly variable point density, non-uniform sampling of the 3D space, effective range of the sensors, occlusion, and the relative pose variation. Engelcke et al. [277] first propose sparse convolutional layers and L1 regularization for efficient large-scale processing of 3D data. Qi et al. [278] propose an end-to-end deep neural network called PointNet, which learns point-wise features directly from point clouds. Qi et al. [279] improve PointNet which learns local structures at different scales. Zhou et al. [280] close the gap between RPN and point set feature learning for 3D detection task. Zhou et al. [280] present a generic end-to-end 3D detection framework called VoxelNet, which learns a discriminative feature representation from point clouds and predicts accurate 3D bounding boxes simultaneously.
In autonomous driving application, Chen et al. [281] perform 3D object detection from a single monocular image. Chen et al. [282] take both LiDAR point cloud and RGB images as input and then predict oriented 3D bounding boxes for high-accuracy 3D object detection. Example 3D detection result is shown in Fig. 11. 9) 2D, 3D pose detection: Human pose detection aims at estimating the 2D or 3D pose location of the body joints and defining pose classes then returning the average pose of the top scoring class, as shown in Fig. 12 [291] propose an end-toend architecture for joint 2D and 3D human pose estimation in natural images which predicts 2D and 3D poses of multiple people simultaneously. Benefit by full-body 3D pose, it can recover body part locations in cases of occlusion between different targets. Human pose estimation approaches can be divided into two categories, one-stage and multi-stage methods. The best performing methods [292][9] [293][294] typically base on one-stage backbone networks. The most representative multi-stage methods are convolutional pose machine [295], Hourglass network [286] and MSPN [296].

A. Conclusions
Deep learning based object detection has been fast developed with the emergence of powerful computational devices. For deploying on more accurate applications, the need of high accuracy real-time system is more and more urgent. For achieving high accuracy and efficiency detectors is the ultimate goal of this task, researchers have developed a series of directions such as, constructing new architecture, extracting rich features, exploiting good representations, improving  processing speed, training from scratch, anchor-free methods, solving sophisticated scene issues (small objects, occluded objects), combining one-stage and two-stage detectors to make good results, improving post-processing NMS method, solving negatives-positives imbalance issue, increasing localization accuracy, enhancing classification confidence and so on. With the increasingly powerful object detectors in security field, military field, transportation field, medical field and life field, the application of object detection is gradually extensive. In addition, a variety of branches in detection domain arise.
Although the achievement of this domain has been effective recently, there is still much room for further development.
B. Trends 1) Combining one-stage and two-stage detectors: On the one hand, the two-stage detectors have a densely tailing process to obtain as many as reference boxes, which is time consuming and inefficient. To address this issue, re-searchers are required to eliminate so much redundancy while maintaining high accuracy. On the other hand, the one-stage detectors achieve fast processing speed which have been used successfully in real-time applications. Although fast, the lower accuracy is still a bottleneck for high precision requirements. How to combine the advantages of both one-stage and twostage detectors is still a big challenge.
2) Video object detection: In video object detection, motion blur, video defocus, motion target ambiguity, intense target movements, small targets, occlusion and truncation etc. bring this mission hard to achieve good performance in both the actual living scenes and remote sensing scenes. Delving into sports goals and more complex data such as video is one of the key points for future research.
3) Efficient post-processing methods: In the three (for onestage detectors) or four (for two-stage detectors) stage detection procedure, post-processing is an initial step for the final results. On most of the detection metrics, only the highest prediction result of one object can be send to the metric program to calculate accuracy score. The post-processing methods like NMS and its improvements may eliminate well localized but high classification confidence objects, which is detrimental to the accuracy of the measurement. More efficient and accurate post-processing method is another direction for object detection domain. 4) Weakly supervised object detection methods: Utilizing high proportion labelled only with the object class but not the object bounding box images to replace a large amount of fully annotated images for training is high efficient and easy to get. Weakly supervised object detection (WSOD) aims at utilizing a few fully annotated images (supervision) to detect the large amount of non-fully annotated ones. Thus developing WSOD methods is a significant problem for further study.
5) Multi-domain object detection: Domain-specific detectors always achieve high detection performance on the specified dataset. So as to get a universal detector which is capable of working on various image domains, a multi-domain detector dont require prior knowledge of the newly domain of interest can address this problem. Domain transfer is a challenging mission for further study.
6) 3D object detection: With the advent of 3D sensors and diverse applications of 3D understanding, 3D object detection gradually becomes a hot research direction. Compared to 2D image based detection, LiDAR point cloud provides reliable depth information that can be used to accurately localize objects and characterize their shapes. LiDAR enables accurate localization of objects in the 3D space. Object detection techniques based on LiDAR data often outperform the 2D counterparts as well.
7) Salient object detection: Salient object detection (SOD) aims at highlighting salient object regions in images. Video object detection is to classify and locate objects of interest in a continuous scene. SOD is driven by and applied to a widely spectrum of object-level applications in various areas. Given salient object regions of interest in each frame can assist accurate object detecting in videos. Thus salient object detection is a vital previous process for high-level recognition tasks and challenging detection missions. 8) Unsupervised object detection: Supervised methods are time consuming and inefficient in training process, which need well annotated dataset used for supervision information. Annotating a bounding box for each object in large datasets is expensive, laborious and impractical. Developing automatic annotation technology to release human annotation work is a promising trend for unsupervised object detection. Unsupervised object detection is a future research direction for intelligent detection mission. 9) Multi-task learning: Aggregating multi-level features of backbone network is a significant way to enhance detection performance. Furthermore, performing multiple computer vision tasks simultaneously such as object detection, semantic segmentation, instance segmentation, edge detection, highlight detection etc. can enhance separate task performance by a large margin because of richer information. Adopting multitask learning is a good way to aggregate multiple tasks in a network, and it presents great challenges to researchers to maintain processing speed and improve accuracy as well.
10) Multi-source information assistance: Due to the popularity of social media and the development of big data technology, multi-source information becomes easy to access. Many social media information can provide both pictures and descriptions of them in textual form, which can help detection task. Fusing multi-source information is an emerging research direction with the progress of various technologies. 11) Constructing terminal object detection system: From the cloud to the terminal, the terminalization of artificial intelligence can help people deal with mass information and solve problems better and faster. With the emergence of lightweight networks, terminalized detectors are developed into more efficient and reliable devices with broad application scenarios. The chip detection network based on FPGA will make real-time application possible. 12) Medical imaging and diagnosis: FDA (U.S. Food and Drug Administration) is promoting AI is medical devices and firstly approved AI software, IDx-DR, which is a diabetic retinopathy detector achieves higher than 87.4% precision, in April 2018. For customers, the combination of image recognition systems and mobile devices can make cell phone a powerful family diagnostic tool. This direction is full of challenges and expectations.
13) Advanced medical biometrics: Utilizing deep neural network, researchers began to study and measure atypical risk factors that had previously been difficult to quantify. Using neural networks to analyze retinal images and speech patterns may help identify the risk of heart disease. In the near future, medical biometrics will be used for passive monitoring.
14) Remote sensing airborne and real-time detection: Both military and agricultural fields require accurate analysis of remote sensing images. Automated detection software and integrated hardware will bring unprecedented development to these fields. Loading the deep learning based object detection system to the SoC (System on Chip) realizes the real-time high-altitude detection.
15) GAN based detector: Deep learning based systems always require large amounts of data for training, while Generative Adversarial Network is a powerful structure for generating fake images. How much you need, how much it can produce. Mixing the real world scene and simulated data generated by GAN trains object detector to make the detector grow more robust and obtain stronger generalization ability.
The research of object detection still needs further study. We hope that deep learning methods will make breakthroughs in the near future.