Exploring Deep Learning-Based Architecture, Strategies, Applications and Current Trends in Generic Object Detection: A Comprehensive Review

Object detection is a fundamental but challenging issue in the field of generic image analysis; it plays an important role in a wide range of applications and has been receiving special attention in recent years. Although there are enomerous methods exist, an in-depth review of the literature concerning generic detection remains. This paper provides a comprehensive survey of recent advances in visual object detection with deep learning. Covering about 300 publications that we survey 1) region proposal-based object detection methods such as R-CNN, SPPnet, Fast R-CNN, Faster R-CNN, Mask RCN, RFCN, FPN, 2) classification/regression base object detection methods such as YOLO(v2 to v5), SSD, DSSD, RetinaNet, RefineDet, CornerNet, EfficientDet, M2Det 3) Some latest detectors such as, relation network for object detection, DCN v2, NAS FPN. Moreover, five publicly available benchmark datasets and their standard evaluation metrics are also discussed. We mainly focus on the application of deep learning architectures to five major applications, namely Object Detection in Surveillance, Military, Transportation, Medical, and Daily Life. In the survey, we cover a variety of factors affecting the detection performance in detail, such as i) a wide range of object categories and intra-class variations, ii) limited storage capacity and computational power. Finally, we finish the survey by identifying fifteen current trends and promising direction for future research.


I. INTRODUCTION
Object detection is a combination of image classification with precise object localization that provides a complete and proper understanding of the image. Previously, Manual feature extraction followed by shallow trainable architectures was used for object detection. However, with the advent of deep learning tools, we have overcome many limitations of traditional object detection techniques that have the ability to learn semantic and deep level features. Generic object detection further divided into different categories such as face detection [1], pedestrian detection [2] and skeleton detection [3], etc. It is a fundamental computer vision process that provides detailed semantic information of image and video.
The associate editor coordinating the review of this manuscript and approving it for publication was Naveed Akhtar .
It has many applications in various fields of life, such as human behavior analysis [4], face recognition [5], image classification [6], medical diagnosis, and autonomous driving [7], [8]. Recently this field gains the attention of many researchers [9], [10]. Object detection comprises two operations; object localization that determines the location of an object in the image, objects classification that determines to which category the object belongs. However, localization in object detection becomes difficult due to occlusions, significant variations in viewpoints, scales, poses, and lighting biasness.
Traditional object detection models have three main modules: informative region selection, extraction of features, and classification.
INFORMATIVE REGION SELECTION is a process of selecting the objects that appear in image at a different position with variable aspect ratios or sizes. It uses the multi-scale VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ sliding window to scan the whole image. Region selection is a highly computational process that produces redundant windows at all possible locations of an object in the image. Fixed sliding windows cause many unnecessary region productions. FEATURE EXTRACTION is a fundamental task for object recognition that base on visual features extraction to represents the semantic and robust nature of the object. some feature representative are SIFT [11], HOG [12], and Haarlike [13]. Manual designing of a robust feature descriptor is intricate, which perfectly describes objects of all kinds due to the diversity of appearance, illumination condition, and background.
CLASSIFICATION is the process of categorizing the target object from all other categories. Besides that, it needs to make representation more informative, semantic, and hierarchical for visual recognition. Some effective classifier are AdaBoost [14] Support-Vector machine (SVM) [15], AdaBoost [14], and Deformable part-based model (DPM) [16].
The era of computer vision inventions begins with the development of the Deep Neural Network (DNN). It marked a major revolution with the invention of the CONV properties (R-CNN). DNN works differently from the traditional approach due to more in-depth architecture, the ability to learn sophisticated features, and a robust training algorithm that allows features to learn the informational representation of objects without manually designing them [17].
Since the invention of R-CNN, a significant number of different and improved models have been proposed in the field of object detection such as Fast R-CNN, which improves the object detection task by combining Bounding Box regression and classification task [18].
In contrast, Faster R-CNN generates a region proposal using the additional sub-network [9], while using fixed grid regression in YOLO for object detection tasks [19]. All of these object detection algorithms make real-time object detection more achievable by providing a better and more accurate way to identify objects on a basic R-CNN.
Salient object detection uses segmentation on pixel-level and local contrast enhancement, while generic object detection uses bounding box regression (BB) for detection. Generic object detection is closely related to pedestrian and face detection by adapting multi-scaling and fusion of multifeatures. Face and pedestrian images have regular geometric structures; however, complex variations in structures and layout are common limitations.

II. CONTRIBUTIONS
Numerous surveys have been published in recent years on the generic object detection. In this survey, we have discussed many of the most advanced object detection models based on deep learning. The main differences between this article and previous studies are mentioned below: 1. A comprehensive overview of state-of-the-art object detectors in the light of technical assessment is included in this article. The history of the development of object detection is spread over one quarterly period (1990 to 2020). Most of the previous surveys merely focus on the limited historical period or some specific detection tasks without considering the technical evaluations. The history highway presented in this survey not only helps to build the readers complete academic rankings also help in finding the future directions of this fastgrowing field. 2. Moreover, unlike previous surveys based on object detection, this survey systematically and comprehensively reviews the in-depth exploration of the keytechnologies of deep learning-based object detection methods. Following the development of the latest models, new trends have now emerged the models with new technologies such as bonding-box-regression, hardnegativity-manning, and multi-scale detection. 3. More in-depth analysis and discussion in various aspects of object detection provide for the first time in the field.
The rest of the paper is as follows. Section II provides a comparison with previously proposed surveys. However, Section III includes a brief history of deep learning and a brief introduction to CNN's underlying architecture. So far, Section 4 outlines the latest methods for detecting generic objects with a full range of backbone frameworks (uses for base feature extraction), benchmark datasets, and performance evaluation parameters. In contrast, Section V provides the role of object detection in five different fields. The last section presents some exciting trends and development trends in the future.

III. COMPARISON WITH PREVIOUS SURVEYS
Many impressive surveys of generic object detection have been published, as shown in Table1. These surveys are performed on applications that detect particular objects such as text detection [26], face detection [27], [28], pedestrian detection [2], [29], [30] and vehicle detection [31]. Some of the recent surveys focus directly on generic object detection issues rather than working principles. However, most of the research reviewed in [32]- [34] dates back to before 2012, covering the period before the overwhelming and surprising success of deep learning methods. Deep learning leads to significant advances in areas such as object detection, natural language processing, genomics, speech recognition, drug discovery, medical imaging, and visual recognition by allowing systems to learn abstract, complex, and subtle representations.
Although there have been many published deep-learningbased surveys such as [17], [35], [36], nevertheless, recent improvements in the field of object detection need to be FIGURE 1. Object detection application domain: object detection has two main sub-domains like salient object detection and generic object detection, which further divides into two branches (face detection and pedestrian detection); Saliency detection studies in the context of the visual system. Pedestrian detection is an essential task of any surveillance system, while face detection uses for security purposes.
put together, especially for new researchers who want to research computer vision. The scope of the paper is generic object detection, instead of specific object detection such as face recognition [37]- [39], pedestrian detection [40], [41], vehicle detection [42], and traffic sign detection [43] is not considered.

IV. DEEP LEARNING: A BRIEF HISTORY
Before we go into details of deep learning-based object detection, it is essential to explore the benefit of deep learningbased architecture (i.e., CNN). A neural network with deep architecture is known as a deep model. The era of the neural network begins in 1940 [53]; the basic idea behind it was to solve the common problem of learning by mimicking the human brain. The popularity of deep-learning increased in the late 1980s and 1990s with the development of a backpropagation algorithm proposed by Hinton et al. [54].
In early 2000, the popularity of deep learning began to decline due to a lack of big data, high computational power requirements, and performance insignificance as compared to other machine learning tools. The rise in popularity of Deep learning began in the year of 2006 with the fantastic and surprising results in speech recognition [55]. Some of the recovery factors of deep learning listed below: 1. The availability of large annotated training datasets such as ImageNet [56] is the main reason for its success. 2. The invention of high performance parallels computing systems such as the GPU cluster. 3. There are significant advances in deep learning model architecture and training strategies: Auto-Encoder (AE) [57] and Restricted Boltzmann Machine (RBM) [58] provide a good start through unsupervised and layer-wise pre-training strategies. The problem of overfitting during the training process can be solved using data augmentation and dropout regularization [59], [60]. However, Batch normalization (BN) uses for time optimization in the training of deep neural networks [61]. The era of high performance begins with advances in network architecture, such as AlexNet [59], GoogleNet [62], VGG [63], Over feat [64], and ResNet [65] etc. The basis of deep learning models is a typical Convolutional Neural Network (CNN) model, such as VGG16 [17]. The featured map is an additional name for the layers in the CNN model, and its input layer is a 3D matrix of pixel intensity of three color channels (red, green, and blue).
A feature map makes an inner layer multi-channel image, and its pixel values are considered special features. Each neuron attaches to a neuron adjacent to the posterior layer. Filtering and pooling transformations on feature maps can create more robust feature specifications [59], [66], [67]. The filtering-transformation uses to filter the matrix convolution to obtain the corresponding field values of the neuron and the final response by applying non-linear activation functions such as ReLU or Sigmoid function [68]. Ultimately different flavors of pooling operations such as max pooling, average pooling, L2 pooling, global pooling, and local contrast normalization [69] are used to create more robust features.
Multiple Fully Connected Layers (FCs) are used with convolution and Pooling Layers to build the initial feature hierarchy in a supervised manner to perform various visual tasks. A specific conditional probability of each neuron in the output layer is obtained by using separate activation functions according to the visual task. Finally, network optimization is performed via SGD (Stochastic Gradient Descent) with objective functions such as means square error (MSE) and cross-entropy loss. At the same time, cropping or rescaling operations are needed to handle different sizes.
Some of the advantages of CNN over traditional methods are listed below: VOLUME 8, 2020 The ability of the deep neural network to express is far higher than that of conventional methods. The deep neural network can learn the hierarchy of features automatically directly from data using a multiphase structure that represents a multi-level representation ranging from pixel to high-level semantic features. CNN architecture can provide improvements in several tasks such as bounding box regression and classification in a multi-task learning manner such as the one used in Fast R-CNN [18].

V. GENERIC OBJECT DETECTION
Generic object detection is the process of localizing an object using a rectangular bounding box to indicate the confidence of the object in the image and to classify the object with a label. The Generic Object Detector divided into two sub-categories named region proposal base detector and the regression/classification based detector.
The Region proposal detector follows the traditional method of object detection, first driving the region's proposed generation then classifying the regions into different categories. R-CNN [10], SPP-net [77], Fast R-CNN [18], Faster R-CNN [9], R-FCN [78], FPN [79], and Mask R-CNN [80] are some of the example of region proposal framework. The regression/classification base detectortakes object detection as a regression problem for locally separated bounding boxes and possible class probabilities. A single neural network predicts the bounding boxes and class probabilities directly from the whole images in one assessment. Classification and regression-based framework mainly comprises of different methods such as Multibox [81], AttentionNet [82], G-CNN [83], YOLO [19], SSD [84], YOLOv2 [85], DSSD [86], and DSOD [87]. The correlation between these methods shows in FIGURE 2.

A. REGION PROPOSAL OBJECT DETECTOR
The region's proposed object detection framework mimics the human brain's attention span. First, it scans the entire scenario and then focuses on the region of interest. Among the other mention related work, OVERRRFEAT [64] has the most promising performance. It was the first time CNN has been introduced in sliding window mode, which predicts the bounding box directly from the top of the highlighted feature map after gaining the confidence of the underlying object category.

1) R-CNN
It was a time when deep architecture was used to significantly improve accuracy and high-level feature of candidates' bounding boxes. In 2014, Ross Girshick proposed an object detection model called R-CNN to solve these problems and achieved a 30% improvement over the proposed methods (DPM HSC [88]) on PASCAL VOC 2012.
The R-CNN model includes three modules, such as region proposal, extraction of deep CNN-based features, and classification/localization. The architecture of R-CNN is shown in FIGURE 3.

a: GENERATION OF REGION PROPOSAL
The R-CNN model used a selection search [89] to extract a region proposal and generate 2000 regions' proposals from a single image. The saliency indication and bottom-up grouping have been used to provide a faster selection of more accurate arbitrary size candidate boxes and to reduce the search space for object detection [22], [56].

b: DEEP FEATURE EXTRACTION BASED ON CNN
At this stage, the CNN module [59] uses the fixed size resolution wrap or crop region proposal to extract approximately 4096-dimensional features. Due to its high learning potential, dominant expressive power, and CNN's highly advanced architecture, the high level of semantic and robust features draw from each region's proposal.

c: LOCALIZATION AND CLASSIFICATION
At this stage, several region proposals are score as a set of positive regions and background as the negative region with pre-trained category-specific linear SVM for various categories. The final object location is secured by adjusting and filtering the score regions using Bounding Box Regression (BB) and Non-Maximum Suppression (NMS), respectively. Typically, pre-trained models are used to solve the problem of insufficiently labeled data. Instead of using unsupervised training, R-CNN first performs the training process on a large auxiliary dataset such as ILSVRC and then implements a specific domain fine-tuning process for improvement.
Asides from the significant use and improvement of CNN over traditional methods, there are still some gaps and disadvantages that need to be highlighted.
The fully-connected layer (FC) requires a fixed-size input that directly leads to a re-computation of whole CNN for each region proposal and increases the test time. Multi-step training is R-CNN major drawback. Firstly, the Convolutional Network (ConvNet) requires finetuning for the region's proposal then apply fine-tuning to the Softmax classifier's learning, which replaces by SVM to fit in with ConvNet features, finally trained the bounding-box regressor. The R-CNN training phase is expensive in terms of time and space. It stores the extracted feature of each region proposal on disk. Even training small datasets take much time with deep networks like VGG16. The memory requirement for these datasets is also alarming. Region proposal generation using selection search is the time-expensive process that produces a large number of redundant regions.
Many strategies have been proposed to address these issues, such as MCG [90] form a multiple hierarchical segmentation by exploring the different scales of image and aggregate different regions to produce proposals. The traditional graphcut approach was replaced by geodesic based segmentation in the GOP [91]. Edge box method [92] extracts the object with fewer contours straggling their boundaries in bounding boxes instead of producing distinct segments. However, DeepBox [93] and SharpMask [94] uses pre-extracted reranking to avoid un-necessary region proposals.
Furthermore, some of the researchers have solved the problem of incorrect localization through better strategies such as Gupta et al. [95] propose object detection based on semantic segmentation on RGB-D images. It uses geocentric embedding for pixel encoding on depth images. Object detection, combined with a super-pixel classification framework, gives promising results on semantic segmentation tasks. Zhang et al. [96] perform sequential bounding box regression using the optimization of the Bayesian-based search algorithm and penalized localization inaccuracies using trained classspecific CNN classifiers with structure loss. Ouyang et al. [97] propose a novel technique base on deformable CNN that imposes a geometric penalty on the deformation of various object parts along with deformation-pooling constrain (defpooling).

2) SPP-Net
R-CNN uses wrapping and cropping operations at the suggestion of each region proposal for the fully connected layer that takes only a fixed size input image. Cropping operation can cause partial content loss of the desired object, and wrapping operation can produce geometric distortion. These content losses and distortion can decrease object detection accuracy, especially in the varying image scales. A novel CNN architecture based on the theory of spatial pyramid matching [98], [99] named SPP-net was proposed in [77] that removes the limitation of a fixed size network. SPP-net uses multiple standard scale-finers to perform the image partition into the number of divisions and aggregates the quantified local feature to produce mid-level representation.
The architecture of SPP-net shown in FIGURE 4, which reuses the fifth Conv layer (conv5) feature maps to generate fixed-length feature vectors from the projection of arbitrary size region proposal. The comparison between R-CNN and SPP-net is shown in FIGURE 5.
The local response strength and relationships with the spatial position of a feature map make it feasible for reusability [77]. The spatial pyramid layer (SPP layer) is stacked after the final-layer of the Conv layer in the architecture. If Conv5 has a three-level pyramid, then it has 256 features maps. The final feature vector of the region proposal has a dimension of 5376 after the SPP layer. The better result can be obtained from SPP-net with an accurate estimation of the corresponding scale of different region proposals. However, sharing computation costs can improve the efficiency over a testing period.

3) FAST R-CNN
However, SPP-net has shown impressive improvements in efficiency and accuracy in object detection over R-CNN.  [77]: an SPP-net (spatial pyramid pooling layer) insert between the FC layers and Conv layers. Conv 5 is the last layer contains 256 filters. SPP-net divides the input features into sub-images and extracts patch feature in each sub-image. Still, it needs to be developed to meet storage space requirements during multi-stage pipelines such as extraction of features, network fine-tuning, SVM classifier and bounding box regressor training and fitting. Furthermore, unless the SPP layer causes an aggressive reduction in the accuracy of a deep network, the fine-tuning algorithm [77] does not update the conventional layer. Based on bounding box regression and multi-task loss classification, A novel CNN architecture proposes to overcome the problems mentioned earlier, called Fast R-CNN [18].
Fast R-CNN has the same architecture as the SPP-net, except for the use of the SSP layer of a single level pyramid, as shown in FIGURE 6. Fast-RCNN uses the Conv layers to generate the feature map by processing the whole image, and then use the pooling layer on RoI (Region of Interest) to extract the fixed size length feature vector of the region proposal.
These feature vectors fed two consecutive Fully-Connected layers before branching into two separate output layers. A layer is used for calculating softmax classification probabilities of C + 1 categories (C for object classes and an additional one for background), while other layers perform refining of bounding box regression (four real-valued coordinates).
Multi-task loss is used to optimize the parameters in an end-to-end manner. A multi-task Loss for bounding box regression and classification is defined as follows: Log loss of ground truth calculates by L cls (p, u) = − log p u While u and p are driven from the discrete probability distribution p = (p0, . . . ., pc) from the last FC layer over the C +1 outputs. Predicted offset t u = t u x , t u y , t u w , t u h use to evaluates L loc (t u , v), where x, y, w, h denote the two coordinates of the bounding box center, width, and height, respectively. Each t u adopts the parameter settings in [10] to specify an object proposal with height/width shift and scale-invariant translation in log-space. All background ROIs omitted by employed the inversion bracket indicator function [u ≥ 1].
A smooth L 1 loss uses to fit the bounding box regressors properly and provides more robustness against outlier and eliminates the sensitivity in exploding gradients: where Back-propagation through the SPP layer on training instances (i.e., ROIs) from different images is inefficient. First, Fast-RCNN adopts the hierarchal approach for mini-batches; it randomly sampled N different images, then each image sampled into R/N ROIs, where the number of ROIs is represented by R. Critically, the region of interest ROIs with the sameimage shares the computations and memory in the backward and forward pass. In contrast, counting the FC layers requires an intensive amount of time during the forward-pass [18]. VOLUME 8, 2020 FIGURE 6. Fast R-CNN framework [18]: it consists of pre-trained CNN (train on ImageNet classification task), and an ROI pooling layer replaces the final pooling layer. While two branches replace final two FC layers: 1. softmax layer (K+1 categories) 2. Bounding box regression branch.
The truncated Singular Value Decomposition (SVD) [101] can be used to accelerate the testing procedure and to compress the FC layers. Fast RCNN processes all layers of the network with multi-task loss and training in single-stage. It provides effective memory storage strategies and training schemes to improve accuracy.

4) FASTER R-CNN
Most of the state of the art object-detection models are tied to region proposal generation methods such as Edge-Box and selective search, which hinders the improvement of accuracy. Ren et al. [9] proposed a model to address this issue by sharing the full image Conv features with detection networks called the Region Proposal Network (RPN). RPN can predict object bounding box and class confidence scores simultaneously using FC-network at each position. Analogous to [89], RPN generates proposals for rectangular object proposals set for randomly size images. RPN operates on the shared layers and specific Conv layers of an object detection network.
As FIGURE 7 shows the architecture of RPN, it is fully connected to the spatial window of size n × n and slides over a Conv feature map. Each sliding window generates a low dimensional vector and is finally fed to two siblings FC layers, namely bounding box (BB) regression (reg) and classification layer (CLS). Complete architecture is the combination of n×n Conv layer and two 1×1 sibling Conv layers with the non-linear objective function (ReLU) in the output layer of n × n the Conv layer. Comparing a proposal relative to bounding boxes (anchors) produces regression toward the true-bounding box. Faster R-CNN adopts three different scales and aspect ratios for detection. The loss function of Faster R-CNN is the same as (1).
where, p i is the predicted probability of i-th anchor being an object. The ground-truth label p * i equals to one (for positive anchor) otherwise zero.
The predicted bounding box coordinates (Four parameters) are stored in t i whereas t * i containing the information of positive anchor with overlapping to ground truth box. However, L cls , L reg are binary log loss and smooth L 1 loss similar to the (2). The losses normalize with the number of anchor locations (N reg ) and mini-batch sizes (N cls ) respectively. Use of backpropagation and SGD for end-to-end training of Faster R-CNN based on the fully-Convolutional network. With the invention of Faster R-CNN, all-region proposal base CNN networks are trained end-to-end manner. However, RPN produces regions that resemble objects (including backgrounds) rather than an object instance. It has difficulty dealing with extremely large or shaped objects.

5) R-FCN (REGION BASE FULLY CONVOLUTIONAL NETWORK)
It is a deep network based on the RoI pooling layer, which is divided into two sub-networks, such as unshared RoIwise subnetwork and shared fully-Convolutional subnetwork, which is independent of ROIs. This arrangement mimics the early proposed classification architectures (e.g., AlexNet [59] and VGG16 [18]), comprising of several Fully connected FC layers and Convolutional subnetwork that separated by specific spatial pooling layers. The new state-of-the-art classification networks is fully Convolutional, such as Residual Nets (ResNet) [65] and GoogLeNet [62], [102]. Therefore, a fully-Convolutional object detection network except RoIs-wise sub-network adapts these architectures and generates naivesolution [65]. Translation variance in object detection and translation invariance in image classification causes inconsistencies. Thus, the shifting of the object in the image does not affect the classification result, while any object translation in the bounding box has a robust and meaningful impact on the FIGURE 7. The RPN in the framework of Faster R-CNN [9]. It aggregates the region proposal network with the CNN model. Faster -RCNN composes of RPN and fast-RCNN with share Conv-layers. Pre-defined anchor boxes represent as K, which are convoluted with each sliding window to produce vectors of fixed-length that is taken by classifier and regressor layer to obtain the corresponding output [9]. object detection process. Translation invariance can be controlled by using manually inserting the RoI pooling layer into convolutions at the expense of additional unshared regionwise layers. So Li et al. [78] propose a fully convolutional region-based architecture, as shown in FIGURE 8.
The R-FCN network uses the Conv layer to produce a position-sensitive score map of size K2 with a fixed grid k × k and to aggregate the score map response using the position-sensitive RoI pooling layer. Finally, the average of the position-sensitive score produces a C + 1 − d vector and computes classification across categories in each RoI. A class-agnostic bounding box is obtained by appending another 4k 2 − d Conv layer. A more powerful classification network with fully-convolutional architecture can be used with R-FCN to accomplish object detection by sharing nearly all layers and obtained state-of-the-art results on both PAS-CAL VOC [103] and Microsoft COCO [104] datasets at a test speed of 170ms per image [105].

6) FPN
As shown in FIGURE 9 (a), the scales invariance of object detection systems can be avoided by constructing feature pyramids on the image pyramids (featured image pyramids) [16], [77]. However, it rapidly increases memory consumption and training time. In some of the techniques, a single input scale is used to represent high-level semantics. In contrast, scale-variation can lead to an increase in robustness, as shown in FIGURE 9(b), and inconsistency between VOLUME 8, 2020 train/test time increase due to the construction of the image pyramid during test time [9], [18]. As shown in FIGURE 9(c), the Deep ConvNet generates a feature map of various spatial resolutions using in-network feature hierarchy, and unusual depth introduces significant semantic gaps. Previously proposed methods have built feature pyramids from the middle layers and avoided using low-level features or sum transformed feature responses, and missing the higher resolution maps of the feature hierarchy.
However, FPN [79] architecture is based on the top-down pathway and bottom-up pathway. It combines low-resolution and semantically robust features with high resolution using several lateral connections, as shown in FIGURE 9(d). With the stride of 2, the down-sampling of feature maps produces feature hierarchy in the bottom-up pathway approach of forwarding backbone ConvNet. While in the top-down pathway approach, a reference set of feature maps is built by selecting the last layer of each network stage, which is a group of output maps of each fixed-size layer. Feature maps of higher network stages are un-sampled and enhanced using an authentic connection of the same spatial size from the bottom-up to build a top-down pathway. The channel dimensions have been reduced by appending a 1 × 1 Conv layer to the un-sample map while element-wise addition using for emergence. The final feature map is generated by adding Conv 3 × 3 to each merged-map and thereby reduces the aliasing effect. The most exceptional resolution map is obtained using multiple iterations. Finally, the feature pyramid of all levels of rich semantics and scales is extracted that is trained end to end like this state-of-the-art representation can be achieved without compromising memory and speed. Meanwhile, FPN does not use CNN architecture as the backbone and apply to different object detection stages (such as region proposal generation) and many other computer vision tasks (e.g., instance segmentation).

7) MASK R-CNN
Instance-segmentation is a challenging task that consists of two independent functions, such as object detection and instance segmentation (Semantic segmentation [106]) in the image. While the Mask R-CNN uses an additional branch specifically for pixel-to-pixel segmentation mask prediction, parallel to the existing two branches (classification and bounding box regression prediction), similar to Faster R-CNN as shown in FIGURE 10 [80].
The segmentation mask branch maintains the explicitobject spatial layout encodes into the m × m mask. With fewer parameters, this fully-convolutional architecture is more accurate than the model used in [106]. In Mask R-CNN, the multi-task loss is a combination of segmentation mask branch loss, classification, and bounding box regression loss. The loss of classification is related to the class groundtruth, while the prediction of the category depends on the branch of classification. RoI pooling is the core operation of Faster R-CNN that produces standard local quantization to extract features and introduces misalignment between features and RoI. It affects the classification results due to small translation robustness and has a significant negative impact on the pixel to pixel mask prediction. The Mask R-CNN uses the RoIAlign layer, which is free and straightforward from quantization to preserve the explicit per-pixel spatial correspondence. RoIAlign is obtained by replacing the Harsh quantization of RoI pooling with bilinear interpolation [107], and the input features values are extracted at quart regularly sampled locations computed in each RoI bin.
Regardless of its simplicity, the mask accuracy can be improved with minor changes under strict localization metrics. An additional mask branch with the Faster R-CNN model can assist in other object detection tasks with a small computational burden. Mask RCNN is an efficient and flexible framework that generates precise instance segmentation and object detection. It can easily be generalized to perform other tasks with minimal modification, such as human pose estimation [4]. It was the first time that Mask R-CNN used for scene instance segmentation and provide intelligent driving [108], while ensemble approaches can be applied for medical segmentation applications [109].

8) OTHER PRACTICAL WAYS TO DETECT OBJECTS
The previously proposed networks yield promising results, but it is struggling to localize small objects due to limited candidate box information and rough feature map. These phenomena become dramatically worse when dealing with the Microsoft COCO dataset, which consists of less prototypical images and objects with various scales that require more  [80]. It has two stages; the first stage generates region proposal of the object and second predicts the class, refine the BB, and create the pixel level mask. Both phases connect to the backbone structure.
precise localization. This issue can be tackled by gathering complementary information from multiple sources through multi-task learning [110], multi-scale representation [111], and context modeling [112].
Learning of Multitask is the process of determining the adequate representation of multiple correlated tasks in the same [113], [114]. StuffNet made a reasonable effort to accurately identify small objects using trained Conv features for 'stuff' such as amorphous categories (ground and water) and object segmentation [110]. Dai et al. [106] propose a three-phase multi-task network to address this issue, called regional instance classification, instance segmentation at the pixel level, and class-agnostic region proposal generation. Li et al. [115] suggest a multi-stage architecture based on region-based object detection and learn the segmentation features using weakly-supervised object segmentation cues.
Multi-scale representation combines multi-layers activation with skipping-connection to use the semantic information of different spatial resolutions [79]. Yang et al. [25] were used various scale-dependent features to investigating layer-wise cascaded rejection classifier ( CRC) and scaledependent pooling. Cai et al. [116] proposed MS-CNN that uses multiple scale-independent output layers to avoid instability between object size and respective fields.
Contextual modelin uses to improve detection efficiency. It uses features of or around the Region of Interest (RoI) of various support regions and resolutions to overcome the concerns of occlusions and local similarities. Zhu et al. [117] proposed a model called SegDeepM, which used the Markov Random Field as well as object segmentation to minimize reliance on initial candidate boxes. Zeng et al. [118] introduced a gated function to control message transmission in various support areas and propose a novel GBD-Net based on message transmission.

B. REGRESSION/CLASSIFICATION OBJECT DETECTOR
The region proposal base framework includes various correlated phases such as region proposal generation, feature extraction using CNN, Bounding Box (BB) regression, and classification, which trains separately. An alternative train-ing requires the development of share convolution parameters between the detection network and the RPN, which is used in the end-to-end module of Faster R-CNN. In realtime applications, the time spent handling different components becomes a hindrance. Fortunately, the time required for the object detection task is reduced with the invention of single-stage frameworks based on class probabilities, mapping directly from image pixel to BB coordinates, and global regression/classification. In this section, some pioneering one-stage object detectors (Convolutional architecture) are discussed, such as YOLO (You Only Look Once) [19], Single Shot MultiBox Detector (SSD) [84], RetinaNet, RefineNet, M2Det, and DSSD, etcetera.
Significant efforts have been made to improve the object detection models as regression/classification tasks. D. Erhan et al. [119] has used CNN-based regression to detect objects by developing test image binary masks and bounding box inference for extracted objects. Even so, locating the overlapping objects and using up-sampling to produce a bounding box is a difficult task. The author proposes a CNN model for object detection based on two parallel branches, as the first branch generates a class agnostic segmentation mask. In contrast, the object center is based on the likelihood of predicting the patch given in the second branch. The performance of the model is efficient because the class score and segment are obtained in the same model, which has mostly joint CNN operations. Yoo et al. [82] proposed an iterative endto-end CNN model for object detection, called AttentionNet. AttentionNet generates a quantized weak direction for a target object and coverage to an accurate object bounding box with an ensemble of iterative prediction starting from top-left and bottom-right corner of an image. The efficiency of the model is quite disappointing when handling multiple categories of the object with the following two steps procedure.
Naijbi et al. [83] proposed iterative proposal-free gridbased object detector (G-CNN) from the fixed grid to boxes tightly surrounding the objects based on extreme-scale. G-CNN trained the regressor to move through a repetitive process, and scale grid elements towards the target-object begin from a fixed multi-scale bounding box grid. However, FIGURE 11. The main idea behind the YOLO (You Look Only Once) [19]: the architecture of YOLO has 24 Conv-layers, followed by two FC layers. Alternatively, 1 × 1 Conv-layers reduce feature space from preceding layers. The Conv-layers are pre-trained on ImageNet classification task at half resolution and double the resolution for detection. First block use for Conv-layers, while FC-layers present as a red column in the diagram. The architecture of SSD300. Prediction of offset to default anchor boxes and their confidence scores uses multiple layers with backbone VGG16. But it discards the FC layers. Instead of using standard FC layers of VGG16, it uses auxiliary convolutional layers. NMS is conducting on multi-scale refined bounding box for the final detection [84].
small or very overlapping objects are challenging to detect using G-CNN.

1) YOLO: YOU ONLY LOOK ONCE
Redmon et al [19] proposed a novel one-stage object detector, predicting the bounding box that uses the topmost-feature map and a direct evaluation of class probabilities. The idea behind YOLO is to divide the image into S × S grid cells, and each grid cell is responsible for predicting the center of the object in the grid cell, as shown in FIGURE 10.
However, it predicts the Bounding box B and its corresponding confidence scores. The confidence score indicates the probability that an object is present in the grid which defines as, Pr (Object) * IOU truth pred such that Pr(Object) ≥ 0 and IOU truth pred indicates the confidence of its prediction. Regardless of the number of binding boxes, the probabilities of a conditional class (Pr( Class i | Object) are predicted in each grid cell. It should notice that it only considers the contribution of grid cells that contain objects. The confidence score of a particular class is a product of individual box confidence predictions and probability of conditional class at the testing time, which explains the following: However, predicted box and existing probabilities of classspecific objects in BB are in focus for fitness between objects. Loss function optimization during training is defined as follows: RetinaNet utilized ResNet-FPN as a backbone network to predict different sized objects [122]. The author uses the Conv-net feature hierarchy in a pyramidal shape. To make feature pyramid with strong semantic at all scale, the author combines the low-resolution features with high resolution through a top-down pathway and literal connection.
The subscript i representing the ith grid cell that point to a center of relative bounding box denotes as (x i , y i ), while (w i , h i ) is a normalized height and width relative to image size, C i is a confidence scores, where 1 abj i indicates the existence of objects and j th bounding box predictor use for prediction represent as 1 abj ij . If an object is included in the grid cell, then the Loss function is penalized for classification error. However, the predictor penalizes bounding box coordinate and ground truth box errors (i.e., the highest IoU of any predictor in that grid cell achieved). The architecture of YOLO compose of twenty-four Conv layers and two FC layers; some of the Conv layers construct ensembles of inception modules with 1 × 1 reduction layers followed by 3 × 3 Conv layers.
In real-time, the model can process 45 FPS images, while other versions of YOLO can process 155 FPS with much better results than other real-time object detectors. Furthermore, YOLO can collaborate with Fast R-CNN and produces less FP (false positive) on the background. Several powerful strategies, such as dimension clusters, Batch Normalization, anchor boxes, and multi-scale training, were adapted to develop an improved version of YOLO [85]. Detecting realtime objects are very challenging due to the limited memory and computation power. To address these challenges, QI-CHAO et al. [120] suggested a lightweight network based on Darknet-53, with a Multi-scale feature pyramid for multiscale detection object called Mini-YOLOv3.

2) YOLOv2
This framework is the second release of YOLO [19], which provides impressive improvements in speed and precision by adopting a series of design decisions for previous work [85].
BATCH NORMALIZATION: It is unreasonable to normalize the entire training set as SGD uses mini-batches to estimate the mean or variance of each activation function during training. Finally, it sampled the element of each minibatch in the same distribution called the BN layer [61]. In YOLOv2, the BN layer is added before each convolutional layer for convergence and regularity. The use of batch normalization has increased mean AP by 2%.
HIGH-RESOLUTION CLASSIFIER: Backbone classifier has increased the input resolution from 224 × 224 to 448 × 448 in the detection process. To solve the problem of input resolution variation, YOLOv2 has included a finetuning process in the classification network for ten epochs on the ImageNet dataset, which increases the mAP by up to 4%.
CONVOLUTIONAL WITH ANCHOR BOXES: YOLO uses Fully-connected layers to generate the coordinates of the predicted boxes. However, in Faster R-CNN, the anchor boxes are used as a reference to generate the offset of predicted boxes. YOLOv2 adopts a high-speed R-CNN prediction mechanism for class prediction and objectness for every anchor box and removes the FC layers, increasing the recall by 7% while mAP decreases by 0.3%. YOLOv2 uses K-mean clustering on the bounding box of the training set for better detection, while Faster R-CNN empirically identified the size and aspect ratio of anchor boxes.
FINE-GRAINED FEATURES & MULTI-SCALE TRAIN-ING: High-resolution feature maps can provide useful information for localizing small objects. YOLOv2 combines the low-resolution feature with high-resolution features by stacking adjacent-features across different channels, such as identity mapping in ResNet. The network can predict detection to varying resolutions by randomly selecting image dimension size (320, 352, . . . .608) after every ten batches. YOLOv2 achieved 78.6% mAP and 40FPs on high-resolution detection in PASCAL VOC 2007.
A Novel backbone framework, DarkNet-19, proposes for YOLOv2. The backbone architecture consists of 19 convolutional layers and five max-pooling layers, which provide high accuracy and require minimal operations to process the image. The YOLOv2 has 78.6%mAP and 40FPS, while Faster R-CNN with ResNet backbone has 76.4% mAP and 19FPs, and SSD500 has 76.8% mAP and 19FPs.

3) YOLOv3
YOLOv3 has some improvement over YOLOv2, such as YOLOv3 uses independent logistics classifiers for multilabel classification for more complex datasets containing many overlapping labels [121]. In YOLOv3, three different scale feature maps are used to predict of the bounding box. At the same time, predicting 3D tensor encoding class, objectness, and bounding box base on the last convolutional layer. YOLOv3 suggests another profound and robust feature extractor called Darknet-53, inspired by ResNet.
Experimental results show that YOLOv3 (AP: 33%) is three times faster than DSSD (AP: 33.2%) but slower than RetinaNet (AP: 40.8%) on MSCOCO dataset and matrics. However, the old detection matric of mAP at IOU=0.5, YOLOv3 has 57.9%mAP, while in DSSD500 and RetinaNet [122], it is 53.3% and 61.1%, respectively. YOLOv3 can perform better for detection of a small object due to multiscale predictions compared to medium and more massive sized objects.

4) SINGLE SHOT MULTIBOX DETECTOR (SSD)
YOLO has difficulty dealing with a generalization of objects in unusual aspect ratio/ configuration, and multiple downsampling operations produce standard features. Due to the strong influence of spatial constraints on the prediction of the bounding box, It also struggles to detection a small object.
To address these problems, Liu et al [84] proposed a model inspired by MultiBox adopted anchor [81], RPN [9], and multi-scale representation [111], called Single Shot MultiBox Detector (SSD) to address these problems. SDD uses specific feature maps for detection instead of the default grid that is used in YOLO; SDD achieves better performance due to the ratio of different aspect ratio, a set of default anchor boxes, and scales to discretize the output space of bounding boxes. SSD can handle objects of different sizes by combining the predictions of multiple feature maps with different resolutions. The architecture of SSD consists of a VGG16 backbone network with numerous feature layers for predicting default boxes offset of various scale and aspect ratio with their corresponding confidence scores at the end of the system. A weighted sum of Softmax (e.g., confidence loss) and Smooth L1 (e.g., localization loss) use for network training. NM is applying on multi-scale refined bounding boxes to get a final detection result. SSD significantly performs three times faster than Faster R-CNN on PASCAL, VOC, and COCO by intelligently integrating with data augmentation, a large number of default chosen anchor boxes, and hard-negative mining. The SSD300 uses image size 300 × 300 use in SSD300, which runs at 59 frames per second, and is faster and more efficient than the YOLO.
However, SSD yields the worst results when dealing with small objects. While Improve feature extractors backbone frameworks such as ResNet101 and additional large-scale context using some deconvolution layers with skip connections [86] and improve network structures such as Dense Block [87], and Stem-Block can be used to address this issue. Although, much useful research has been conducted since the invention of SSD, such as Cheng et al. [86] proposed an encoder-decoder hourglass structure to detect the object to pass contextual information before prediction called DSSD (Deconvolutional Single Shot Detector). It introduces a largescale context in object detection by combining ResNet101 (as the backbone) with some deconvolution layers (to solve the problem of shrinking resolution of feature maps on CNN). Cheng et al. [123] proposed the Inception Single Shot Multi-Box Detector(I-SSD) with a new inception block inspired by GoogLeNet Inception block and the deep residual network; improve accuracy without increasing the complexity of the model and affecting its speed.

5) DSSD
Deconvolutional Single Shot MultiBox Detector is a modified version of the SSD that has two additional modules, such as the deconvolutional module and the prediction module [86].
Each prediction layer contains the residual block in the prediction module then adds the output of the residual-block and prediction layer by factor. The Deconvolutional block strengthens features by increasing the resolution of the feature maps. After a prediction module, each deconvolutional layer is used to predict objects of various sizes.
Initially, the author uses a pre-train Renet101backbone network on the ILSVRC CLSLOC dataset in the training process then performs the actual SSD network training on the detection dataset of 321 x321 inputs or 513 x513 inputs sizes. Finally, freeze the weights of the SSD module with the train deconvolution module. Experimental results show the improvement of the DSSD513 model on both PASCAL VOC and MS COCO datasets. However, the deconvolution module and prediction module improved the PASCAL COV 2007 test dataset by 2.2%.

6) RETINANET
Lin et al [122] proposed a unified object detector with a novel classification loss function called Focal Loss.
The R-CNN has two separate phases; a set of region proposals is generated in the first phase, while each candidate location is classified in the second phase. A two-stage object detector can perform better than a one-stage object detector because it produces a dense set of candidate locations and filters out the majority of negative-locations. The extreme foreground-background class imbalance is the main reason when network training converges in the one-stage detector. Therefore, the proposed loss function called focal loss can minimize the weight-loss assigned to easy or well-classified examples.
In the training process, focal loss avoids a large number of simple negative cases and concentrates on the hard training examples. By training unbalanced positive and negative instances and inheriting the speed of a previously proposed one-stage detector, the RetinaNet substantially eliminated the disadvantages of one-stage detectors.
Experimental results show that RetinaNet has a 6% improvement in AP with Resnet-101 FPN as compared to DSSD513 on the MS-COCO test dataset. With ResNeXt-101-FPN, RetinaNet has improved the AP by 9%. RetinaNet shows notable improvements in detection precision on small and medium objects by large margins.

7) TINY RETINANET (REAL-TIME DETECTION)
Chang et al [124] proposes a novel one-stage detector with MobileNetV2-FPN as a backbone (feature extractor). Its architecture consists of Stem block backbone network and SEnet, followed by two subnets with a specific task. It improves accuracy and reduces the information loss. It uses the RetinaNet focal loss as a classification loss. A model is tested on PASCAL VOC 07/12 with 71.4%mAP and 73.8%mAP, respectively.

8) M2Det
Zhao et al. [125] suggested a multi-level feature pyramid network (ML-FPN) that develops a more compelling feature pyramid to overcome the issue of scale variation across object instances. The working principle of the model is based on three main steps to achieve the final incremental feature pyramid. In the first step, Multi-level features extracted from multiple layers in the backbone are fused as a base feature. The base feature is fed into a block consisting of two modules, namely Thinned U-shape Modules, and the Feature Fusion Modules jointly, and obtains the decoder layers of TUM as the features for the next step. Finally, decoder layers of equivalent scale are integrated to construct the featurepyramid consisting of multilevel-features. So far, multi-scale and multi-level features have been developed. The rest of the network follows the SSD architecture to achieve the results of classification and bounding box localization in an end-to-end manner.
The M2Det, one-stage detector with VGG backbone, achieves 41.0% AP at 1.8FPS speed with a single-scale inference strategy and 44.2% AP with multi-scale inference strategy on MS COCO test-dev dataset. It performed 0.9% better on the RetinaNet800 but twice as slow as the RetinaNet800. The multi-level feature pyramid network used in M2Det is shown in FIGURE 14.

9) REFINE-DET
RefineDet [126] consists of two interconnected modules, the refinement module, and the object detection module. The transfer connection block is used between modules to transfer and enhance features from former to latter modules to better object prediction. The end-to-end training process involves three stages, such as pre-processing, detection (two inter-connected modules), and NMS. The one-step regression method is used in classic one-stage detectors such as SSD, YOLO, and RetinaNet to achieve final results. The two-step cascade regression method can better predict hard objects, especially small objects and more precise locations of objects.

10) OBJECT AS POINTS
Although the image classification area has recently become less active, object detection research is not yet mature. In 2018, a paper entitled ''CornerNet: Detecting Objects as Paired keypoints'' introduced a new perspective on detector training [127]. Since preparing anchor box target is a daunting task, is it really necessary to use them as before? This new trend of digging anchor boxes is called ''anchor free'' object detection. Corner box supports boundary box regression using heat maps produced by box corners. The scheme is inspired by the Hourglass network, which uses heat maps to estimate human suffixes. The object center is described using a heat-map, and the network regresses the box height and width of the box directly from these centers.
The CornerNet is using each pixel as a grid cell. With the help of Gaussian distributed heat maps, it is easier to exchange training than previous attempts to register the bounding box size directly. The elimination of anchor boxes is also effective as previously detector relay on IOU between ground truth and anchor box to assign training targets. Some of the neighboring anchors may get a positive target for the same object and network to learn the multiple anchors for the same object. Non-maximum suppression is (the greedy algorithm) used to fix this issue. Now we have one peak per object in the heat-map by eliminating anchors. Since NMS is sometimes difficult to implement and slow to run, getting rid of NMS is a waste of resources. One big advantage is that it operates in a variety of environments with limited resources.

11) EFFICIENT-DET
Efficient-Det is an exciting development in the object detection area [128]. This research proved that the FPN structure is a powerful technique to improve the detection of network performance at various scales. Different flavours of FPN seen in YOLOv3 and RetinaNet before applying regression and classification. Plain-layer FPN structure may benefit from more design optimization in NAS-FPN and PANet. A new structure of an FPN called BiFPN is proposed in Efficient-Det. BiFPN allows the feature aggregation back and forth by adding cross-layer connections. It removes some useless parts from the architecture from the original PANet to justify the efficiency of the network. Weight feature fusion and additional learnable weight to feature aggregation are also innovated to improve the efficiency of a network over FPN. It also introduces a principle way to scale an object detection network. It has the same accuracy as YOLO v3 while having much fewer FLOPs.
LATEST DETECTORS: Relation Network for Object Detection: Hu et al. [129] propose that Relation object detection network includes an adapted attention module that considers the interaction between different targets in an image, including geometry information and physical feature. The relation module is used in the head of the detector before fed to classifier and regressor to produce more enhanced features for accurate classification and localization. It replaces the NMS post-processing step to gain higher accuracy than NMS. The performance of backbone networks such as Faster R-CNN, FPN, and DCN on the COCO test-dev dataset may increase efficiency by 0.2, 0.6, and 0.2%, respectively.
DCNv2: Dai et al. [130] propose a deformable convolutional network(DCN) that adapts geometric variation that reflects in the productive spatial support region of target for learning. ConvNets can only focus on the features of the fixed square size (according to the kernel); thus, the corresponding field does not adequately cover the entire pixel of a target object to represent it. To overcome this issue, the deformable ConvNets can produce deformable kernel, and the offset from the fixed size initial convolution kernel is learned from the networks. However, deformable RoI pooling is also useful for the localizing objects of different shapes. A deformable ConvNet can produce 4% higher accuracy than three plain ConvNets. It has a 37.5% mAP (mean Average Precision) under strict COCO evaluation criteria. DCNv2 uses more layers than DCNv1. The learnable scalar is used to modulate all deformable layers, which enhance the accuracy and deformable effects. The feature mimicking is used to improve detection accuracy by incorporating a mimic feature loss to the per-RoI feature of DCN, which is similar to useful features extracted from crop images. Experimental results show that DCNv2 [131] with strong backbones achieved a 5% improvement in mAP over DCNv1 on the COCO2017 testdev dataset under the strict evaluation criteria of MSCOCO.

NAS-FPN:
A new feature pyramid architecture is found when the authors from Google Brains adapt neural architecture search, named NAS-FPN, which provides top-down and bottom-up connections for feature fusion of different scale [132]. It repeats the FPN architecture N times and concatenates them in the form of monumental architecture during the search phase, it imitates by picking arbitrary level features using high-level feature layers. Most of the significant efficient architectures use the connection between highresolution input feature map and output layer to generate high-resolution features to identify small objects. Adopting high-capacity architecture, stacking more pyramid networks, and adding feature dimensions significantly increases detection accuracy. Experimental results show that the mean average precision of NAS-FPN increases up to 2.9% on the COCO test-dev dataset over the original FPN by adopting ResNet-50 as the backbone of 256 feature dimensions. NAS-FPN can achieve 48.0% mAP on the COCO test-dev dataset by utilizing an excellent backbone like AmoebaNet and stacked seven FPN of 384 feature dimensions.

C. BACKBONE CNN ARCHITECTURE
Some CNN models are used as the backbone in the detection frameworks, such as AlexNet, ZFNet, VGGNet, GoogLeNet, Inceptionseries, ResNet, DenseNet, and SENet explains in Table 3. A survey of recent advances in CNN architecture can be found in Gu et.al [133]. The current trend suggests that increasing layer depth could improve the strength of CNN architecture representation, such as AlexNet has eight layers, and VGGNet16 [63] has 16 layers. In contrast, some dense network architecture has 100 layers, such as ResNet and DenseNet. Some architectures such as AlexNet [59], the ZFNet [134], and VGGNet have a large number of parameters despite being few layers deep since the large fraction of the parameters come from the fully connected layers. Recent developments at CNN show that new architectures such as Inception, ResNet, and DenseNet have great depth with a fewer number of parameters, avoiding FC layers. The number of parameters in GoogLeNet has been dramatically reduced with the use of carefully designed topologies of Inception modules [62] as compared to AlexNet, ZFNet, or VGGNet. Similarly, ResNet won the ILSVRC2015 classification task using skip connection for learning profound networks with hundreds of layers. InceptionResNets [135]combines the Inception networks with shortcut connections, which can significantly speed up network training. Huang et al. [136] proposed an architecture that extends ResNet under the name DenseNet, which consists of dense block integrated into feedforward fashion, providing some compelling benefits such as feature reuse, parameter efficiency, and implicit deepsupervision. Recently, He et al. [65] proposed a block called Squeeze and Excitation( SE) blocks, which enhance the performance of existing deep architecture at minimal additional computational cost, adaptively recalibrating channel-wise feature responses by explicitly modeling the interdependencies between Convolutional feature channels, and therefore win the ILSVRC2017 classification task. Research on CNN architectures is remained active, with emerging networks VOLUME 8, 2020 such as Hourglass [127], Dilated Residual Networks [137], Xception [138], DetNet [139], Dual Path Network (DPN) [140], fish-Net [141], CBNet [142], DetNAS [143] and GLoRe [144], etc.

D. DATASETS AND PERFORMANCE EVALUATION 1) DATASET
With recent advances in deep learning computer vision, object detection applications can evolve rapidly. In addition to significant improvements in performance, the current approach has primarily controlled the need for large-scale image datasets. Modern evolving techniques use end-to-end pipelines to improve the performance of real-time transactions. Besides that, data is of significant importance, whether used to compared and measure the performance of competitive algorithms or to solve the challenging or complex existing problems. A large amount of big annotated data is the main reason behind the tremendous success of the use of deep learning techniques in object detection. The Internet plays a vital role in building a comprehensive dataset to provide access to a wide range of images covering the vastness and diversity of objects. Five datasets are very popular in the field of generic object detection, namely as PASCAL VOC 2007 [145], PASCAL VOC2012 [103], ImageNet [56], Microsoft COCO [104] and OpenImages [146]. Some selected images of the benchmark dataset shown in FIGURE 15 and Table 4 summarize the specification and attributes of these datasets. Creating massively interpreted datasets requires crowd funding strategies. First, define the target object set categories, secondly collect a collection of images from a diversity of dimensions to represent the specified category selected on the Internet, and finally annotate the collected images.
Each dataset has its particular object detection challenges, including interpretation of commonly available datasets, an annual competition, standardized evaluation software, and similar workshops. Details of the statistics, such as the total number of images, training samples, validation, and test sets of these datasets discuss in Table 5.   increased every year. previous images have also been retained for test results, which are compared by year.
The popularity of Pascal VOC is slowly waning due to the availability of other improved datasets in the market, such VOLUME 8, 2020 Russakovsky et al. proposed a dataset driven from ImageNet [56], increasing the number of classes and images and scaling up the training and evaluation standards of object detection tasks based on PASCAL VOC. The number of images in the ImageNet dataset increased to over 1.2 million with more than 1000 different object categories, namely ImageNet1000. It provides a standardized benchmark for the ILSVRC image classification challenge.

c: MICROSOFT COCO
Lin et al. [104] proposed a database, namely MSCOCO database, based on familiar objects in natural everyday complex scenes to provide richer image understanding. The Objects are labeled with fully segmented instances to test the accurate detector evaluation. The Microsoft COCO dataset has a total of three hundred thousand thoroughly segmented images, with an average of seven object instances per image in a total of 80 categories. Some key points made MSCOCO more challenging than PASCAL 2012, such as the existence of fewer iconic objects and amid clutter or heavy occlusion with a wide range of scales, with a high percentage of small objects [149] and the evaluation metric requirement for accurate objects-localization. The performance of the object detection task evaluates the use of AP under different degrees of IoU and different sizes of an object. The MS COCO object detection challenge is based on two main object detection tasks (for example, using either instance segmentation or bounding box output). Currently, MS COCO has become the standard for object detection, as ImageNet was in its time.

d: THE OPEN IMAGE CHALLENGE OBJECT DETECTION(OICOD)
Kuznetsova et al. [150] propose the largest publicly available dataset driven from OpenImageV4 (Currently, it was version5 2019). OICOD provides a significant increase in the number of classes, images, bounding boxes, and instance segmentation masks, and also proposed a substantial annotation process, which makes it different from other previous object detection datasets such as ILSVR and MS COCO. OpenImage V4 uses classifiers to annotate images and only uses labels that have significantly high scores for human verification, while ILSVRC and MS COCO have an exhaustively annotated dataset. Human confirmed positive-labels for object instances interpret in OICOD.
The Average Precision (AP) is the performance evaluation terms computed for each category, derived from precision and recall. However, the mean Average Precision (mAP) is an average measure of performance that is calculated for all object categories. Details of performance matric can be found in [148], [154], [155]. Prediction detection b j , c j , p j j of the test image, I is the standard outputs of a detector while jis indexed of b j -object as BB predicted category represents asc j , while the confidence score is represented by p j . A predicted detection is considered as True Positive (TP) if the ground truth label c g is equal to c, and overlap ratio that is Intersection over Union(IoU ) between ground truth Bounding Box b g and predicted BB b is not smaller than a predefined threshold ε, which explains the following: where ∪ and ∩ represent the intersection and union, respectively. Predicted detection shows False Positive (FP) for other values of ε except 0.5. For the acceptance of the predicted class label, c compares the confidence level p to some threshold β. The Comparative results from various detection algorithms using PASCAL VOC show that the robust backbone network can produce better prediction results (comparison among R-CNN with VGG16 or with AlexNet, and SPP-net with ZF-Net [134]). Object detection performance improves with the invention of end-to-end multi-task architecture (FRCN) [78], SPP layer (SPP-Net) [79], and RPN( Faster R-CNN). The importance of data augmentation is increasing with the demand for robust multi-level features in deep learning-based models. Some other factors have a substantial impact on the performance of the object detector such as hard-negative samples mining (e.g., OHEM), multi-scale representation (e.g., ION), contextual information (e.g., StuffNet, HyperNet), modified classification network (e.g., NOC), multi-region and multiscale feature extraction (e.g., MR-CNN). YOLO produces an abysmal result for object localizations of high IoU on PAS-CAL VOC2012. Some strategies, such as batch normalization, anchor box, and fine-grained features, are used for correct R-CNN (YOLO+FRCN) localization errors (YOLOv2). Since the introduction of MSCOCO, special consideration has been given to the bounding box location accuracy rather than using the IOU threshold.
This dataset is more challenging than PASAL VOC 2012 due to the existence of less iconic, diversified scales objects and stricter requirements on object localization. In MS-COCO, Average precision with different degrees of IOUs for the evaluation of object detection performance for this dataset. The object detection performance and localization can improve by using multi-scale training and test with the support of complementary information from other related tasks and additional information in different resolution (R-FCN). Some algorithms, such as DSSD and FPN, can create improved feature pyramids to achieve multi-scale representations. Object detector based on regression/classification (such as SSD and YOLO) is not performing well due to significant localization errors than region proposal based methods, i.e., Faster R-CNN and R-FCN. Contextual information is very beneficial for identifying small objects as it provides contextual information for consulting nearby surrounding objects (multi-path and GBN-Net).
MS COCO contains many non-standard objects that reduce the performance of the object detector. However, the performance can improve with the invention of the robust backbone models (e.g., ResNeXt [156]) and other useful strategies like multi-task learning [80], [130]. Some performance evaluation matric for the PASCAL VOC, ILSVRC, and MS COCO object detector summarizes in Table 6 with some matric modification for the OpenImages Challenges proposed in Kuznetsova et al. [150]. Table 9 shows the time analysis of various object detection algorithms on the NVIDIA Titan x except for the selection-search, which processed on CPU Intel i7-6700k.

3) CHALLENGES OF GENERIC OBJECT DETECTION
High accuracy and high efficiency are the two main competing objectives for the ideal generic object detection task. In the high-efficiency detection task, memory, and storage requirements to run the entire detection task must be acceptable in real-time. However, high-quality detection requires accurate recognition and localization of objects in images or video frames.
A wide range of object categories and intra-class variations are the two main challenges in detection accuracy. Intrinsic factors and imaging conditions are the two types of intra-class. The intrinsic-factor is a possible variation in object instances of a particular category in terms of one or more materials, texture, color, shape, size, and object, which appears in different poses and non-rigid deformations. Variations in the imaging condition are due to unconstrained environmental impacts such as weather conditions, lighting, camera models, physical locations, illuminations, backgrounds, occlusion, and viewing distance. Significant variations in object appearance are caused by intra-class such as scale, cluster, pose, illumination, blur, occlusion, cluster, shading, and motion. Poor resolution, noise corruption, digitization patterns, and filtering distortions can increase the challenges of object detection, as shown in FIGURE 17. In practice, the current object detector focuses primarily on structured object categories, such as twenty categories in VOLUME 8, 2020    PASCAL VOC [148], ILSVRC [154] with two hundred types and ninety-one classes in MS COCO [104].
The demand for visual data analysis increases with the prevalence of mobile/wearable devices and social media networks. Due to limited storage capacity and computational power, the efficient object detection task becomes critical with mobile/wearable devices. The efficiency challenges increase with the possibility of a wide range of objects categories, location, and scales diversion within a single image. An object detector should be able to handle high data rates, past invisible objects, and unknown situations.
Manual annotation becomes impossible with the increase in images and categories, which can lead to weakly supervised strategies.

VI. GENERIC OBJECT DETECTION IN DIFFERENT FIELDS
Human has been taking the assistance of AI (computer vision in particular) to perform many of his daily tasks in different areas, such as security military, transportation, medical, and daily life fields. Detail descriptions of the methods and techniques used in these fields listed below. VOLUME 8, 2020

A. OBJECT DETECTION IN SURVEILLANCE
The pedestrian detection, face detection, fraud detection, anomaly detection, and fingerprint detection are some of the well-known applications used in surveillance matters.
FACE DETECTION uses to detect the human faces in the image or video, but illumination and variation in poses and resolution make it difficult. Many notable innovations found in the past few years, such as Author [157], perform multiple tasks( facial landmarks localization with detection and head pose estimation) simultaneously without affecting the performance of an individual assignment. A novel approach, named Wasserstein's convolutional neural network (WCNN), uses to learn invariant features between visual and near-infrared face images [158]. The architecture comprises low-level layers (trained on the broad visible spectrum of face images) and High-level layers (comprises of three parts, i.e., NIR, VIS, and hybrid NIR-VIS layer). It also designs the appropriate loss function that can enhance the discriminative power of DCNNs based, large-scale face recognition. However, cosine-based softmax losses [159]- [161] provide better results in deep learning-based face recognition.
High discriminative features were achieved using an Additive Angular Margin Loss(AcrFace) for face recognition [162]. Gue et al. [163] proposed an innovative technique for a single image per person for face recognition called fuzzy sparse auto-encoder. . Significant variation appears in imaging condition due to changes in the appearance of the same class (a, h) such as lighting effect, camera models, weather conditions, occlusion, physical locations, and viewing distance. Variation in pose, blur, motion, shading, clutter, occlusion, and scales adds challenges. The intra-class variation instances shown in (i). in contrast, cases in (j) have some examples of interclass-the majority of pictures from ImageNet [154] and MS COCO [104].
PEDESTRIAN DETECTION aims to detect pedestrians in natural landscapes. The benchmark dataset for pedestrian detection is the EuroCity person dataset, which includes pedestrians, cyclists, and other riders in urban traffic scenes [164]. The cascaded approach uses for real-time pedestrian detection named Complexity-aware cascaded pedestrian detectors [165], [166]. For more details, please refer to the survey (deep learning-based pedestrian detection) [167].
ANOMALY DETECTION is an instrumental tool in fraud detection, climate analysis, and any type of detection in healthcare monitoring. A point-wise approach uses to analyze the data in many anomaly detection techniques [168]- [170]. Some unsupervised methods have been used to search the contiguous interval of time and regions in space named ''Maximally Divergent Intervals'' (MDI) [171].

B. OBJECT DETECTION IN MILITARY
Remote sensing object detection, topographic survey, flyer detection are some of the applications of the military field. REMOTE SENSING OBJECT DETECTION is a challenging task that used to detect objects on remote sensing images or videos. Existing object detection techniques for remote sensing is prolonged due to enormous input size with small targets, which makes it infeasible for practical use and hard to detect.
Another hurdle is the extensive and complex background that leads to severe false detections. The researchers adopt the data fusion approach to address these issues. Due to lack of information and minor deviations, the main focus of the strat-egy is small goals that lead to significant inaccuracy. Remote sensing images have different characteristics than naturalimages; thus, transfer learning to a new domain using robust architectures such as Faster R-CNN, FCN, SSD, and YOLO is not working well for remote sensing detection. Designing remote sensing dataset-specific for detector remains a hot research spot in this domain. Zhang et al [172] propose an approach to address the issue of lacking rotation and scaling invariance in RSI object detection using rotation and scaling robust structure. Cheng et al [173] propose a CNN-based RSI object detection models using the rotation-invariant layer to deal with rotation problems. The author suggests another effective method to learn a rotation-invariant and Fisher discriminative CNN model to solve the issues of object rotation, within-class variability, and between-class similarity [174].
Furthermore, the author uses the rotation-invariant and fisher discrimination regularity to optimize the new objective function and improve the performance of the existing framework [175]. Shahzad et al. [176] proposed a novel detection model based on automatic labeling and recurrent neural networks. Real-time remote sensing methods proposed in [177]. Long et al [178] proposed a framework that would concentrate on automatically and accurately locating objects. Li et al [179] proposed a novel framework based on RPN (to deal multi-scale and multi-angle features of geospatial objects) and local-contextual feature fusion network (to address the appearance ambiguity problem).

C. OBJECT DETECTION IN TRANSPORTATION
Deep learning greatly facilitates humans in many applications of transportation fields such as autonomous driving, traffic sign recognition, and license plate recognition.
LICENSE PLATE RECOGNITION is gaining fame with the popularity of automobiles industry related to crime tracking, residential access traffic violations tracking. License plate recognition models become more robust and stable with the use of edge information, sliding concentric windows, connected component analysis, texture features, and mathematical morphology. At the same time, many deep learning methods for license plate recognition provide beneficial assistance in daily life [188], [189].
The AUTONOMOUS DRIVING vehicle needs accurate estimates of their surroundings to operate reliably. Additionally, it is beneficial to transform the deep learning methods and sensory data into semantic information. 3D object detection methods provide information about size and location (monocular, point-cloud, and fusion). Monocular image-based detection predicts 2D bounding boxes than extrapolated them to 3D, which limits the accuracy of localization. Point-cloud based methods are time-consuming as it projects point clouds into a 2D image to generate a 3D representation directly in a structure.
At the same time, fusion-based techniques fuse both front view images and point-clouds to produce a robust detection. Lu et al. [190] proposed novel architecture based on 3D convolutions and RNNs, to generate a centimeter-level localization accuracy in different real-world driving scenarios. 3D car instance understanding and sensor fusion techniques are notable in autonomous driving [191], [192]. For further studies, please refer to the recently published survey [193].
TRAFFIC SIGN RECOGNITION is an essential part of autonomous driving. Real-time accurate traffic sign recognition helps drive by acquiring temporal and spatial information of the potential sign. The literature contains very beneficial deep learning methods, such as [194]- [197].

D. OBJECT DETECTION IN MEDICAL
Medical image detection (x-rays, CT images, MRI, fundus images), tumor detection, dental disease detection, skin disease detection, and healthcare monitoring are some of the active areas medicine where deep learning is contributing. The novel viruses are a significant issue for global public health. Technology can assist the medical practitioner to identify possible causes. It is beneficial for viral diseases like COVID-19 that can easily be transmitted and have asymptomatic infectivity periods. Hemdan et al. [198] use seven different architecture of CNN in COVIDx-Net, such as VGG19 and Google MobileNet v2. Each model can analyze the x-ray to classify the patient status (either infected or not).
COMPUTER-AIDED DIAGNOSIS SYSTEM can assist the doctors in classifying and diagnosing the different types of cancer. The CAD framework has three main steps, such as image segmentation, feature extraction, classification, and object detection. Due to data privacy and scarcity, there usually exists a distribution difference of data between target and source domain. Therefore, medical image detection needs a domain adaptation framework [199].
Deep learning has shown its perfection and miracles in the medical field, which have significant data in the form of images and numbers. Li et al [200] propose an attention mechanism in the CNN frameworks for Glaucoma detection and design large-scale attention-based glaucoma dataset. A DNA modifications detection framework (Deep-Mod) establish with the help of bidirectional RNN and long short-term memory(LSTM) [201]. Schubert et al [202] propose cellular morphology neural networks (CMNs) for automated neuron reconstruction and detection of synapses. For further detail, please refers to these surveys [203], [204].

E. OBJECT DETECTION IN DAILY LIFE
The event detection, pattern detection, intelligent home, commodity detection, image caption generation, rain/ shadow detection, and species identification are some of the application of life fields. Goldman et al [205] proposed a novel object detector for densely packed scenes such as retail shelf displays and set up dataset SKU-110K to meet this challenge.
EVENT DETECTION uses to discover real-world events on the Internet such as festivals, talks, protests, natural disasters, elections. Multi-domain event detection (MED) provides full details of the events. Yang et al [206] proposed an event detection framework for detecting real-world events from multi-domain data. Wang et al [207] design a novel event detector using online social interaction features and construct affinity graphs. Schinas et al. [208] incorporate 100 million photos/ videos to develop the multi-model graph-based system. For detailed information, please refer to surveys on event detection [209], [210].
There are some challenges in PATTERN DETECTION, such as pose variation, varying illumination, scene occlusion, and sensor noise. The research literature about the repeated pattern or periodic structure detection provides a stable baseline in both 2D images [211], [212] and 3d cloud-points [213]- [216].
IMAGE CAPTION GENERATION is a process in which a computer understands the semantic of an image and automatically generates a caption for the photograph in natural language. The process of image caption involves computer vision and natural language processing. These technologies are difficult to integrate. Multi-model embedding [217], encoder-decoder framework [218], [219], attention mechanism [220], [221], and reinforcement learning [222], [223] are widely used to address this issue. Yao et al. [224] proposed a novel framework using Graph Convolutional Network and LSTM (GCN-LSTM) to explore the connection between objects in spatial and semantic domains. For detailed information, please refer to the image caption generation [225] survey.
Rain detection, shadow detection, and species identification are some of the applications where deep learning performs significantly. Yang et al [226] proposed a novel joint rain detector to detect raindrops in a single image. Zheng et al. [227] proposed a Distraction-aware Shadow Detection Network (DSDNet) using explicit learning and integration of visual distraction regions semantics. Accurate identification of species is the basis of taxonomic research. Handegard et al. [228] used a deep learning model to classify the species present in the image automatically.

VII. DISCUSSION
The following are some of the vital factors in detecting generic object:

A. REGION BASE VERSUS CLASSIFICATION /REGRESSION BASE FRAMEWORKS
A significant drawback of the region-based detector is the requirement for high computational power. Still, its structure is more flexible and efficient than the unified framework, which is suitable for region-based classification.
One-stage detectors (YOLO and SSD) requires less time as compared to the two-stage framework due to lightweight backbone networks, avoiding pre-processing algorithms, fewer candidate region requirements for prediction, and the use of the FC subnetwork. The feature extractor (Backbone network) is the most time-consuming step in object detection [9], [127].
Fully-convolutional pipeline architecture, sliding windows from different layers of the backbone, its combined information, and exploring complementary data from other correlated tasks are some of the crucial design choices to design a better detection framework.
The two-stage framework is the future of object detection in terms of a speed-accuracy trade-off because of the success of cascade for object detection [230]- [233] and instance segmentation on COCO [234].

B. BACKBONE NETWORKS
The backbone network plays a vital role in the performance of object detection tasks. Generally more in-depth backbone framework such as ResNet [65], ResNeXt [156], Inception-ResNet [135], and Darknet53 require high computational power and big data for training to perform well. Some backbone networks are specially designed to focus on speed rather than accuracy, such as MobileNet [235].

C. ROBUSTNESS IN OBJECT RECOGNITION DATASET
Real-world images have many variations in terms of brightness, angle of the image capturing, blur, deformations, background clutter, occlusion, resolution, noise, and camera distortions, which makes it more challenging to detect the object. Object size/scale is significant in the object detection task. In contrast, different techniques are used to handle the pose variation and small object detection challenges such as the use of image pyramids by enlarging the small image and shrink the large one. Furthermore, various techniques such as the use of independent Conv feature maps (SSD [84]), incorporate dilated convolutions [139], [236], use of anchor with different scale, and aspect ratios with higher parameters, and up-scaling can be used for the small object detection [237], [238]. Super-resolution techniques still do not play an essential role in improving the detection accuracy of small objects compared to large ones. Besides that, some applications such as autonomous driving required only general identification of the existence of small objects rather than localization over a vast region.
A spatial transformer network is used to handle occlusion, deformation, and other factors. Regression is used to obtain the deformation field and wraps the feature map in the deformation field [130]. A deformable part-based model [239] considers the spatial constraint to find the maximum response to a part filter [97], [100], [240]. The little research is dedicated to addressing the issue of rotation invariance and occlusion in generic object detection because of less relation variance found in famous benchmark object detection datasets, namely PASCAL VOC, COCO, and ImageNet. In contrast, face detection vigorously is based on occlusion handling study.

D. DETECTION PROPOSAL
Detection proposals have significantly reduced search spaces. However, this undoubtedly requires improvement in the accuracy of localization, recall, speed, and repeatability for future detection proposals [241]. RPN is a dominant region proposal framework based on the CNN detection proposal generation method. It recommends that the proposed detection method in the future should be evaluated based on object detection rather than merely assessing the detection proposal.

E. OTHER FACTORS
Other factors, such as novel training strategies, data augmentation, different combinations of backbone networks, and multiple detection frameworks, can affect the quality of object detection tasks. Some real-world challenges, such as object detection in mobility such as 3D point clouds, video, remotely sensed imagery, and RGBD images remain unresolved issues. Even with the advances in technology, object detection still yields unsatisfactory results from some constrained. Such as poorly labeled data or annotations with fewer bounding boxes, categories of unseen objects, wearable devices, and the ability to adapt and evolve several environmental changes to detect objects in the open world. The future research direction on these challenges is as follows: 1. In general, object detection algorithms do not have the ability to detect objects outside the training dataset. The VOLUME 8, 2020 ultimate goal is to develop an object identification framework capable of localizing and recognizing the thousands of novel objects categories in the open-world scenes with accuracy and efficiency [242], [243]. Larger-scale datasets need to be developed with significantly more classes as existing benchmark datasets cover few hundreds of object categories that are far below the human-recognized categories.
2. The success of generic object detection mainly depends on detection frameworks. A unified framework is more straightforward and faster, while a Region-based detector is more accurate and efficient.
3. Network acceleration [244]- [248] and the design of a compact, lightweight network in the field of object detection is one of the new and growing research areas [235], [249]- [253]. Deeper CNN networks require more computational power, numbers of parameters, bulk data, and GPU for training. 4. Segmenting the object instance at the pixel-level requires a more vibrant and detailed understanding of image contents [80], [104], [254].
5. Currently, most state-of-the-art object detectors are fully supervised models that lack scalability due to the absence of fully annotated datasets. The data annotation process is laborious, become hardening with the volume of the dataset [104], [147], [154]. 6. The success of the object detector majorly based on intensively large annotated training datasets. In contrast, the human can learn visual concepts very quickly from a few instances of events and can often generalize well [242], [255], [256]. Therefore, detecting the Few/Zero-Shot object is a very appealing task that should be done [242], [257]- [261]. 7. New practices such as autonomous vehicles, robotics, and un-crewed aerial vehicles [262]- [264], video [265], [266], and point clouds [267], [268] are some of the challenges where object detection can play a distinct role.
The field of generic object detection still needs to complete substantial research efforts. However, the last five years have been a significant and golden time for object detection. We are optimistic about future developments and opportunities in the field of object detection.

VIII. FUTURE DIRECTION AND CURRENT TRENDS A. HYBRID APPROACH
The two-stage detector is time-consuming and inefficient because it uses a dense tailing process to obtain the most reference boxes. This problem can be solved by maintaining high accuracy and avoiding affordable redundancy. On the other hand, due to the fast processing speed, the one-stage detector is very suitable for real-time applications. Its low accuracy is still a barrier to the use of high precision requirement applications. These methods need to combine to take advantage of both one-stage and two-stage detectors. But how to bring them together is a big challenge.

B. OBJECT DETECTION IN VIDEO (DYNAMIC TARGETS)
It is challenging to achieve an excellent video object detection performance in a real-life scene and remote scene due to video defocus, motion target ambiguity, motion blur, small objects, occlusion, truncation, and intense target movements. Researchers can focus on more complex source data and dynamic targets for future research.

C. EFFICIENT POST-PROCESSING METHODS
Post-processing is the initial step for the final results in the three (for one-stage detector) or four (for a two-stagedetector) stage detection procedure. The accuracy score of the detector is evaluated by sending the highest prediction results of an object in a metric program. The post-processing methods such as NMS and its improvements can eliminate welllocated but high classification confidence objects. Experimenting with more efficient and accurate post-processing methods is another direction for the researchers.

D. WEAKLY SUPERVISED OBJECT DETECTION METHODS
Due to availability and to achieve high efficiency, it is more fruitful for network training to replace a significant portion of fully-annotated images with high proportion labeled images that only have class labels but does not have object bounding boxes. Besides that, the weakly supervised object detection uses a limited amount of fully annotated images to detect non-fully annotated ones. Therefore, the availability of nonannotated big data diverts our attention to a significant problem, such as the development of WSOD methods.

E. OBJECT DETECTION IN MULTI-DOMAIN
The detection performance of a specific domain-related detector in a particular domain (dataset) is always high. Therefore, there is a need for a universal-detector known as a multi-domain detector that is capable of working on various domain images without prior knowledge of the new domain. Therefore, domain transfer is difficult without affecting performance.

F. 3D OBJECT DETECTION
3D object detection becomes a hot and active research direction with the invention of 3D sensors and diverse applications of 3D comprehension. The LiDAR point cloud can be used to locate the objects accurately and describe their shapes and provide reliable depth information. It can be feasible to use object detection techniques of LiDAR data for 2D data as well.

G. SALIENCY DETECTION
Salient object detection emphasizes highlighting significant object regions in the images. At the same time, the object of interest in video object detection is classified and located in a continuous scene. SOD can be applied to a broad spectrum of object-level applications in various areas. It can also assist in accurately detecting the object by providing a salient region of interest in each frame of video. Therefore, it can be helpful in a high-level recognition task, challenging detection task, and highlighting target detection.

H. UNSUPERVISED OBJECT DETECTION
Supervise methods for object detection requires a wellannotated dataset for the training process, which is time expensive and inefficient. Bounding box annotation of each object in large datasets requires a significant amount of time, effort, and impractical. It is needed to develop automatic annotation strategies to eliminate human annotation requirements in the supervised object detection task.

I. FEATURE FUSION & MULTI-TASK LEARNING
Feature fusion is a process that is used to improve the detection performance by aggregating the feature from multiple levels. Furthermore, performing various tasks simultaneously, such as semantic and instance segmentation along with object detection, can improve the efficiency of each task due to in-depth information. Maintaining processing speed and improve accuracy during multi-task learning is a challenging task for the researcher.

J. MULTI-SOURCE INFORMATION ASSISTANCE
Access to multi-source information is convenient due to the development of big data technology and the popularity of social media. Many social media sources also provide textual descriptions along with pictures, which can assist in object detection tasks. The fusion of multidisciplinary information could lead to future research direction for the researcher.

K. TERMINAL OBJECT DETECTION SYSTEM
AI Terminalization can help to deal with a massive amount of information and solve the problem in a better and faster. Lightweight networks emerge from developing a more efficient and reliable terminal detector used in a variety of applications. The FPGA based detection network is very feasible for real-time applications.

L. MEDICAL IMAGING AND DIAGNOSIS
AI-based Medical Devices are getting fame due to its promising accuracy. The FDA (U.S.Food and Drug Administration) approves the use of AI-based software called IDX-DR, for detecting diabetic retinopathy with an accuracy of more than 87.4% in April 2018. A combination of image recognition and smart devices makes the cell phone a powerful family diagnostic tool. The current state of epidemics in the world, such as COVID-19, increases the need for technology. This direction is full of challenges and expectations.

M. ADVANCE MEDICAL BIOMETRICS
Medical risk factors can be studied and monitored more effectively by using a deep neural network that had been difficult to quantify previously. Medical images such as retinal (fundus) images and speech patterns may help identify the risk of heart disease. Similarly, X-ray, Ct images, and immune pattern monitoring may help to diagnose other significant disorders. Soon, passive monitoring can be possible with medical biometrics.

N. REAL-TIME DETECTION AND REMOTE SENSING AIRBORNE
Precise analysis of remote sensing images is very beneficial for agriculture fields and military defense. Automatic detection software and integrated hardware can open new opportunities for countries in these fields.

O. GAN BASED DETECTOR
Data augmentation always helps in deep learning. The deep learning-based systems require a massive amount of images for the training process and a powerful technique of data augmentation, such as Generative Adversarial Network that used to generate fake images closer to reality. Object detector becomes more robust and obtains strong generalization ability using a combination of the real-world scene, and GAN made simulated data.