Object Detection Using Deep Learning, CNNs and Vision Transformers: A Review

Detecting objects remains one of computer vision and image understanding applications’ most fundamental and challenging aspects. Significant advances in object detection have been achieved through improved object representation and the use of deep neural network models. This paper examines more closely how object detection has evolved in the era of deep learning over the past years. We present a literature review on various state-of-the-art object detection algorithms and the underlying concepts behind these methods. We classify these methods into three main groups: anchor-based, anchor-free, and transformer-based detectors. Those approaches are distinct in the way they identify objects in the image. We discuss the insights behind these algorithms and experimental analyses to compare quality metrics, speed/accuracy tradeoffs, and training methodologies. The survey compares the major convolutional neural networks for object detection. It also covers the strengths and limitations of each object detector model and draws significant conclusions. We provide simple graphical illustrations summarising the development of object detection methods under deep learning. Finally, we identify where future research will be conducted.


I. INTRODUCTION
Research and breakthroughs in object detection fall into two main periods. Before 2014, they were marked by traditional detection models, and after 2014 by models based on deep learning. Furthermore, due to the successful application of deep neural networks (DNNs) and convolutional neural networks (CNNs) [1], especially in recent years, the situation in many artificial intelligence fields has improved considerably. As a result, significant progress has been made in computer vision tasks such as classification, segmentation, and object detection [2]. Object detection involves image classification [1] and semantic and instance segmentation [2], [3]. Visual object detection is a process of image classification [3] and localization. This task becomes more complex than simple image classification or classification with localization, as an image usually contains several objects of different The associate editor coordinating the review of this manuscript and approving it for publication was Bing Li . categories. It consists of locating the instances of an object in a given image and assigning each object instance a matching class label from a wide range of predefined classes. Deep learning-based object detection models using convolutional neural networks and transformers are now playing a pivotal role in the evolution of this domain. These models can provide vital information for the semantic understanding of images and videos. It has experienced a rapid rate of adoption in a variety of sectors. Examples include support for autonomous cars to navigate safely in traffic [4], [5], [6], [7], detection of abusive behavior [8], [9], facial detection [10], [11], human behavior analysis [12], [13], and medical imaging such as cancer detection [14], [15], robotics [16], [17], general image processing techniques such as cropping, orientation detection, and contrast enhancement [18], [19], [20], [21], remote sensing applications [214], [215], [216], and many other use cases [217], [218], [219]. Regarding future use cases for object detection, the possibilities are endless. To develop algorithms that can detect objects in a scene, we need to look beyond shallow and deep CNNs. For a better understanding of the dynamics and interactions between objects in these visual scenes, it is necessary to use sequential and relational information modeling to connect objects both in time and space. However, before introducing and clarifying these advanced techniques, it is worthwhile first to understand the evolution of state-of-the-art object detectors, their limitations, and how they can be addressed. This paper presents an in-depth review of several approaches for solving the object detection task. We will explore and discuss the different frameworks used for object detection and the primary data sets and metrics applied to evaluate the detection. We describe the advantages and limitations of the most widely used convolutional neural networks, serving as a backbone for the leading object detection models. Initially, we cover algorithms from the anchor-based family for object detection, including two-and one-stage object detectors. We also review more sophisticated and faster algorithms based on anchor-free and transformerbased object detection approaches. Next, we elaborate on each approach's strengths and weaknesses by comparing the methods mentioned in the paper. Then, we shall provide a discussion of some future directions and prospects.

A. COMPARISON WITH PREVIOUS REVIEWS
All previous studies [22], [35] were limited to an overview and comparison of a limited number of object detection models, although other models were available at their time. Most previous surveys followed the same method of dividing the models into two categories; two-stage and one-stage detectors. Moreover, some have just focused on one aspect of object detection. For example, some have studied the detection of salient objects [26], [30]. Others have studied the detection of small objects [33], [34], and others for tiny objects [31]. In [32], they review the learning strategies of object detector models. In this paper, we tried to cover all the detection models and approaches that depended on deep learning from 2013 to 2022, including the object detection models based on transformers published more recently. No previous work has comprehensively covered and analyzed the number of models we have listed. We also divided the detection models into four categories. The first concerns twostage models based on anchors, the second relates to onestage models based on anchors, the third refers to anchor-free methods, and the last category concerns transformer-based models.

B. OUR CONTRIBUTIONS
The primary motivation of this work is to provide a comprehensive, detailed, and simplified overview through tables and figures of the past and current state of the field of object detection. This paper can be a starting point for researchers and engineers seeking to gain knowledge in this field, especially for those beginning their careers. They can learn about the current situation and contribute to advancing the field. Our contribution differs from previous ones regarding its focus and the number of models mentioned and covered. However, understanding any domain and developing new concepts necessitates knowledge of all existing concepts, including their pros and cons, particularly in a fast-developing field such as object detection. Our work brings some added value to the field of object detection. Therefore, it will provide researchers, especially those starting in this field or those interested in applying these techniques in other specific disciplines, such as healthcare, with an up-to-date, state-of-the-art overview of object detection.

1)
We propose an up-to-date survey that covers older and more recently published object detection models. 2) We present the first review, which covers almost all object detection models based on deep learning. 3) We compare the different backbone networks object detectors use through their strengths, features, and limitations. 4) We suggest a research study outlining and investigating generic object detection approaches from the perspective of anchors and transformers. 5) We summarise the evolution and categories of object detection with deep learning in simplified charts, diagrams, and tables. 6) We outlined promising future directions in the field of object detection.

II. TRADITIONAL OBJECT DETECTION METHODS
The first notable strides in object detection and image recognition began in 2001 when Paul Viola and Michael Jones designed an effective facial detection algorithm [36], a robust binary classifier built from multiple low classifiers. Their demonstration of faces detected in real-time on a webcam was the most impressive illustration of computer vision. In 2005, a new paper by Navneet Dalal and Bill Triggs was published. Their approach, based on the feature descriptor, Oriented Gradient Histograms (HOG) [37], outperformed existing pedestrian detection algorithms. In 2009 Felzenszwalb et al. developed the Deformable Part Model (DPM) [38], another crucial feature-based model. As a result, DPM has proven to be highly successful in object detection applications in which bounding boxes were applied to localize objects, as well as in template matching and other well-known object detection approaches used at the time. Several methods have already been developed to extract patterns from images and detect objects [39], [42]. All traditional methods tend to involve three parts: 1) The first step consists in inspecting the entire image at multiple positions and scales to generate candidate boxes with the use of methods like sliding window [43], [44], max-margin object detection, region proposal like the selective search algorithm [45]. Usually, with sliding windows, capturing several thousand windows in each image is usually necessary. Any costly calculation method used at this first level results in a prolonged process of scanning the entire image. Especially during training, several iterations on the training set are often necessary to include the selected

1) MEAN AVERAGE PRECISION
The mAP value is the mean average precision of all K classes. The average precision (AP) is derived from the precisionrecall curve, calculated for all unique recall levels. The method of computing AP by the PASCAL VOC challenge has changed since 2010. PASCAL VOC Challenge interpolates through all data points, compared to only 11 equidistant points. mAP evaluates the regression and classification accuracies.

2) MEAN AVERAGE RECALL
The mAR value is the mean value of the RAs for all K classes. As AP, the average recall (AR) also represents a numerical metric to compare the detector's efficiency. AR is the mean recall on all IoU values within the [1, 0.5] interval and can be calculated as twice the area under the IoU recall curve. A standard object detection model is divided into four main parts: the input, the backbone, the neck, and the head. The input can be represented by a single image, a patch, or a pyramid of images. The backbone [57] can be a convolutional neural network like VGG [58], ResNet [59], EfficientNet [60], SpineNet [61], CSPDarkNet [62], etc. Then there is the neck which is a network found at the top of the backbone; this network is usually composed of many downstream and upstream paths such as FPN [63], NAS-FPN [64], ASFF [65], PAN [66] and BiFPN [67] or in the form of additional blocks such as SPP [68], RFB [69] and SAM [70]. As for the heads, they can be classified into two categories: those responsible for dense prediction, such as RetinaNet [71], YOLO [72], SSD [73], CornerNet [74], and FCOS [75]. And those responsible for sparse prediction like Faster R-CNN [76], Mask R-CNN [77], and RepDet [78].

IV. BACKBONE NETWORKS FOR OBJECT DETECTION
Regarding object detection and building a robust object detector model, one of the most important factors that should be considered is the backbone network design. The backbone for object detection is a convolutional neural network designed to provide the foundation for an object detector. The backbone network's primary purpose is to extract features from the images before submitting them for further steps, such as the localization phase in object detection. There are several standard convolutional neural network backbones used by object detectors, including VGGNets, ResNets and EfficientNets, etc., which are pre-trained for classification tasks.

A. ALEXNET
AlexNet [3] is a convolutional neural network (CNN) architecture developed in 2012. It consists of eight layers: five convolutional layers, two fully connected hidden layers, and one fully connected output 1000-way softmax classifier layer. AlexNet was the first CNN to win ImageNet Large Scale Visual Recognition Challenge and is a leading architecture for any object-detection task. It uses ReLU activation functions and local response normalization layers.

B. VGGNETS
VGGNet [58] is a convolutional neural network architecture developed in 2014. It uses profound architecture with multiple convolutional and fully connected layers. It consists of five convolutional layers followed by three fully connected layers. The VGGNet architecture is known for its use of small convolutional filters (3 × 3) and a very deep network with 16 to 19 layers. It uses ReLU activation functions and finishes with a softmax classifier. The main idea behind this architecture is to use very small filters (3 × 3) to capture fine details in the images and stack multiple layers to increase the depth of the network; this way, it can learn more complex features.

C. RESNETS
ResNet (Residual Network) [59] is an architecture designed and published in 2015. It is known for its ability to train profound networks without the problem of vanishing gradients, which is a common issue in very deep networks. The original paper on ResNet proposed five different sizes of the model: 18, 34, 50, 101 and 152 layers. Since then, many other variants of ResNet have been developed, such as ResNeXt and Wide Residual Networks (WRN). For example, ResNet-34 uses a plain network architecture inspired by VGG-19, adding shortcut connections. These shortcut connections allow the model to skip layers without affecting performance. The critical innovation of ResNet is the introduction of residual connections, which allow the network to learn the residual mapping between the input and the output of a layer rather than the original mapping. The residual connections allow the network to propagate gradients more quickly and allow for the training of much deeper networks. The resNet architecture uses a building block called ''Residual Block,'' which contains multiple convolutional and batch normalization layers. The final layer is connected to a fully connected layer to classify the images.

D. INCEPTION-RESNET
Inception-ResNet [228] is a convolutional neural architecture that builds on the Inception family of architectures developed by Google in 2016 but incorporates residual connections similar to the ResNet architecture to improve the flow of gradients and allow for the training of deeper networks. The Inception architecture is known for its use of multiple parallel convolutional and pooling layers, also called ''Inception modules.'' Those modules extract features at different scales and then concatenate them before passing them to the next layer. It is 164 layers deep and trained on over a million images from the ImageNet database. The final layers are connected to a fully connected layer to classify the images. The network has a similar architecture schema to Inception-v4, but the difference lies in their stems, Inception, and Residual blocks. The model has achieved excellent performance at a relatively low computational cost.

E. EFFICIENTNETS
EfficientNet [60] is a convolutional neural network and scaling method published in 2019 that uniformly scales all dimensions of depth/width/resolution using a compound scaling approach. This allows the network to balance accuracy and computational efficiency better. It uses a building block called mobile inverted bottleneck convolution (MBConv), which combines depthwise and pointwise convolutional layers. It is similar to MobileNetV2 and MnasNet but is slightly larger due to an increased FLOP budget. The final layers are connected to a fully connected layer to classify the images. EfficientNet-B0 is the base model with a similar architecture as other architectures such as ResNet and VGG, but as the number increases, such as EfficientNet-B7, the architecture becomes more complex, with more layers, more filters, and higher resolution input images.

F. GOOGLENET
GoogLeNet [120], also known as Inception v1, is a convolutional neural network architecture based on the Inception architecture that Google developed in 2014. It utilizes Inception modules, allowing the network to choose the best filters for a given input. GoogLeNet is 22 layers deep, with 27 pooling layers, and consists of 9 inception blocks arranged into three groups with max-pooling in between, also called ''Inception modules.'' Those modules extract features at different scales and then concatenate them before passing them to the next layer and global average pooling at the end. The GoogLeNet architecture won the 2014 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) competition.

G. CSPRESNEXT
CSPResNeXt [62] is a convolutional neural network where the Cross Stage Partial Network (CSPNet) approach is applied to ResNeXt. CSPNet uses cross-stage partial connections to bypass some of the network's layers, improve the flow of gradients and allow for the training of deeper networks. A residual Network with Extreme cardinality or ResNeXt is an architecture that uses a building block called ''ResNeXt Block,'' which contains multiple branches of convolutional layers with different numbers of filters, allowing the network to learn features at different scales and increases the capacity of the network. CSPResNeXt is used as a feature extractor in YOLO v4 and partitions the feature map into multiple stages, allowing for better learning capability of CNNs.

H. DENSENETS
DenseNet [229] is a network that uses dense connections between layers through Dense Blocks, which contain multiple convolutional layers, batch normalization, and ReLU activation. The final layers are connected to a fully connected layer for image classification. The architecture is characterized by its use of dense connections, which connect each layer to every other layer in a feed-forward fashion, which draws representational power from feature reuse instead of extremely deep or wide architectures. Each layer is connected directly with every other layer in the network, creating a dense connectivity pattern that allows the network to propagate the gradients through the network more efficiently and effectively, enabling the training of deeper networks. The architecture allows for a significant reduction in the number of parameters compared to traditional architectures.

I. SENET
SENet [230] (Squeeze-and-Excitation Network) is a convolutional neural network architecture published in 2017. The architecture employs squeeze-and-excitation blocks to enable the network to perform dynamic channel-wise feature recalibration. This feature improves the feature representation capabilities of CNNs. The architecture uses a building block called ''SE block,'' which contains two sub-layers: a ''Squeeze'' layer, which reduces the dimensionality of the feature maps, and an ''Excitation'' layer, which adaptively recalibrates the feature maps. The Squeeze layer applies global average pooling to the feature maps to obtain a channel descriptor, which is then passed through a fully connected layer, also called a bottleneck layer, to reduce the dimensionality of the descriptor.

J. HOURGLASS
Hourglass [231] architecture is a convolutional neural network (CNN) used for human pose estimation, object detection, and semantic segmentation tasks. The architecture is characterized by its repeated bottom-up and top-down processing, similar to an hourglass shape, which allows the network to learn the input's fine and coarse features. The Hourglass architecture consists of several modules stacked on top of each other. Each hourglass module is a sub-network consisting of convolutional and pooling layers at the top, followed by up-sampling and convolutional layers at the bottom that reconstruct the input feature maps. These modules are connected in a ''skip'' or ''residual'' connection fashion, allowing information to flow from one module to the next.

L. CSPDARKNET
CSPDarknet [199] is a convolutional neural network and backbone for object detection developed in 2020. It is based on the architecture of the Darknet, and It employs a CSPNet strategy to partition the feature map, which includes an activation function and attention mechanism. The architecture uses a series of CSP blocks with an increasing number of layers; the output of each block is concatenated with the output of the corresponding block in the previous stage; this allows the network to learn fine and coarse features the input. The final output is obtained by applying convolutional layers on the feature maps generated by the last CSP block.

M. CONVNEXT
ConvNeXt [232] is a pure convolutional model inspired by the Vision Transformers design. ConvNeXt is built entirely from standard ConvNet modules. It retains the efficiency of standard ConvNet while being fully convolutional for learning and testing, making it simple to implement. ConvNeXt has fewer activation functions and normalization layers than other backbone networks and separates the downsampling layer. The model was evaluated on various vision tasks such as ImageNet classification and object detection. It showed higher performance in all major benchmarks. ConvNeXt uses convolutions that operate on a per-channel level by shuffling only the information in the spatial dimension. Depth convolutions are clustered convolutions where the number of clusters equals the number of input channels. In ConvNext, depth convolutions are used in MobileNet and later in EfficientNet.

V. DATA AUGMENTATION
During training, models adopt different learning strategies such as localization refinement, data augmentation, cascade learning, and Imbalance sampling. Those strategies help the models work efficiently to improve the accuracy and execution time for both localization and classification tasks. For example, data augmentation is one of the most efficient techniques to improve model results because it does not add any inference complexity. However, still not commonly used due to the complexity of designing methods that can efficiently handle both the localization and classification by transferring strategies between the two. Several augmentation techniques include color space, cropping, rotation, translation, Kernel filters, Random erasing, noise injection, color jittering, mixing images, and flipping.

A. COLOR SPACE
Color space is a popular augmentation method when dealing with digital RGB images. The augmentation is performed on the R, G, and B channels by selecting a single channel and adding two more zero matrices. It is also possible to apply other color enhancements, such as contrast, brightness, and saturation, by manipulating the values of the RGB matrices.

B. ROTATION
The rotation method rotates the image to the left and right at an angle between 1 • and 359 • . The degree of rotation is an essential factor to consider in this method. The extent of rotation is chosen according to the image type and the problem to solve. For example, a slight rotation of about 20 • is used in applications related to number detection, such as MNIST. Whereas using an extensive orientation, one risks losing the label value of the image.

C. TRANSLATION
The basic concept of this augmentation is to shift the images in four directions, top, bottom, right and left. This technique allows the preservation of the spatial dimension of the image by filling the remaining space after translation with constant values like 0 and 255 or through random and Gaussian noise.

D. CROPPING
The cropping technique is often used when a data set has different heights and widths. The technique is used to crop the central patch of each image. Random cropping is almost similar to the translation technique, except that translation preserves the spatial dimension of the image, whereas random cropping reduces the size of the image.

E. KERNEL FILTERS
This augmentation technique is commonly used to clean or blur images by applying filters. The process of kernel filters is similar to that of convolutional neural networks. The idea is to either slide a matrix with a Gaussian blur filter onto the image to produce a blurred image or to slide the matrix with a high-contrast vertical or horizontal edge filter to produce a sharper image.

F. RANDOM ERASING
Random erasing [222] is an augmentation method inspired by the dropout regularization mechanism and aims to solve the occlusion problem encountered in image recognition problems. Random erasing assists the model in avoiding the occlusion problem and overfitting by forcing the model to learn the defining features of the image. The operating mechanism of this method involves randomly selecting a patch in the image and masking it with average pixel values, zeros, or 255s.

VI. ANCHOR-BASED DETECTORS
The anchor boxes represent a predefined collection of bounding boxes with selected widths and heights that reflect the widths and heights of the objects in the training data set. They also include, of course, various aspect ratios and scales found in the dataset. When detecting, the predetermined anchor boxes are arranged in a tiled pattern on the image. Moreover, the same anchors are constantly proposed on each image. Instead of predicting the boxes, the network predicts the probability and other attributes, such as background, intersection on union (IoU), and offsets for each tiled anchor box. VOLUME 11, 2023  It returns a unique collection of predictions for each set anchor box. Generating bounding boxes can be described as follows: (1) Create thousands of candidate anchor boxes that best describe the objects' size, location, and shape.
(2) Predict the offset for each bounding box. (3) Compute a loss function for each anchor box based on ground truth. (4) For each anchor box, compute the Intersection Over Union (IOU) to check which object's bounding box has the largest IOU. (5) When the probability is more significant than 0.5, notify the anchor box that it should detect the object with the highest IOU. and factor the prediction into the loss function. (6) If this probability is marginally less than 0.5, we instruct the anchor box not to learn from this sample since the prediction is ambiguous; otherwise, if the probability is remarkably less than 0.5, then the anchor box is likely to predict that no object is present. Finally, by using this process, we ensure that the model learns to identify only true objects. Using anchor boxes allows a network to detect multiple objects, objects of different scales, and overlapping objects. In object detection, anchor-based detectors define anchor boxes at each position in the feature map. The network predicts the probability of objects in each anchor box and then fits the size of the anchor boxes to fit the object. However, anchors require careful design and application in object detection frameworks. (a) The coverage ratio of the instance's location space is among the most critical factors 35488 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  in anchor design. To ensure a good recall rate, anchors are thoroughly engineered based on the statistics computed from the training/validation set [79], [80]. (b) Some design choices based on a particular dataset may not apply to other applications, which affects the generality [81]. (c) During the learning phase, the anchor-based approaches rely on intersection union (IoU) to define the positive/negative samples, thus adding extra computation and hyper-parameters for an object detection system [82]. Anchor-based object detection frameworks generally fall into two sections: two-stage, propositionbased detectors and one-stage, proposition-free methods. 1) Two-stage object detection. 2) One-stage object detection. The anchors serve as regression references and classification candidates for predicting proposals for two-stage detectors and final bounding boxes for one-stage detectors.

A. TWO-STAGE METHODS
Region-based object detection algorithms were among the most widely used techniques for detecting objects in images. The first models in object detection start intuitively by searching the regions and then performing the classification. The two-stage methods are derived from the R-CNN methods that extract RoI using a selective search method [83] and then classify and regress them. Faster R-CNN [76] is the most well-known two-stage anchor-based detector reference. It uses a separate region proposal network (RPN) that generates RoI by modifying predefined anchor boxes and a region-based prediction network (R-CNN) [84], [85] to detect objects. Many models were subsequently introduced to improve its performance. For example, using bilinear interpolation, the Mask R-CNN [77] replaces the RoIPool layer with the RoIAlign layer. Other models look at different aspects to improve performance. For example, some target the whole architecture, such as [86] and [89], some use multiscale learning and testing [90], [91], others feature fusion and enhancement [63], [92], the introduction of the new loss function and training [93], [95], and some employ better proposal and balance [96], [97]. In contrast, others apply context and attention mechanisms. Specific models also employ different learning strategies and loss functions.  [50] are assigned to each class to classify the object's occurrence in the proposed candidate region under a given feature vector. It also has a linear regressor that predicts four offset values to enhance the selected bounding boxes' accuracy and minimize localization errors. The R-CNN consisted of three simple steps: Scan the input image to detect objects that may be present using the selective search algorithm by proposing approximately 2000 candidate boxes. Then for every candidate box, we apply a CNN for feature extraction. The result of each CNN is transmitted to an SVM for classification and for a linear regressor to refine the object's bounding box. R-CNN is very easy to use but very slow.

2) SPATIAL PYRAMID POOLING NETWORK (SPPNet)
Based on the concept of spatial pyramid matching (SPM) [98], SPPNet [68] is mainly an improved version of R-CNN [85]. SPPNet has implemented a specific CNN procedure known as Spatial Pyramid Pooling (SPP) during the passage of the convolutional layer and the fully connected layer. In transitioning from the convolutional layer to the fully connected layer, it proposes having multiple pooling layers at different scales instead of a single pooling layer often used as a standard in other methods. A selective search algorithm is applied by SPPNet to generate about 2,000 region proposals per image, just like R-CNN. Next, it extracts features directly from the whole image using ZFNet [99] only once. At the final conv layer, the feature maps delineated by each region proposal pass through the SPP layer, followed by the fully connected layer. Each bounding box has its SVM and bounding box regressor. SPPNet uses the SPP for every region proposal to pool the features of that region from the global feature volume to produce its fixed-length representation. SPP solves the problem of cropping the image before entering CNN at a fixed size, as in R-CNN with VGG [58], where image sizes are fixed (224*224). Thus, with SPP, the images can be used with different shapes. In contrast to R-CNN, SPPNet only deals with the image at the convolutional layers once, whereas R-CNN deals with the image at the convolutional layers at least 2000 times. As shown in Table 4, SPPNet is much faster and more accurate than R-CNN.

3) FAST REGION-BASED CONVOLUTIONAL NETWORK (FAST R-CNN)
To solve some of the problems of R-CNN and SPPNet and to develop a faster object detection algorithm, Ross Girshick published a new paper named Fast R-CNN [84]. Comparing Fast R-CNN with SPP-net, one can observe that the SVM classifiers have been removed, and a regression and classification layer has been connected to the network. VGGNet is used instead of ZFNet, the region of interest (RIO) poling layer, rather than the SPP. On the other hand, Fast R-CNN is similar to the original R-CNN in many ways. However, two major additions have improved its detection speed: They extract the image features before proposing regions rather than forwarding the region proposals to the feature extractor.
Thus, a single CNN is applied to the entire image rather than 2000 CNNs on 2000 regions. The SVM is also changed to a softmax layer, extending the neural network as a prediction model rather than building a new one. The primary CNN with several convolutional layers takes the entire image as input rather than applying a CNN for every region proposal. As a result, the region proposals are based on the last feature map. Therefore, they can build a single CNN for the entire image. Regions of interest (RoI) are detected by applying the selective search method to the feature maps produced. The proposal region is formally resized using an RoI pooling layer to obtain a valid region of interest that can be introduced into a fully connected layer. Fast R-CNN uses a softmax layer instead of many different SVMs to predict the class directly for each region proposal and the offset values for the bounding boxes. Therefore, we have only one neural network to train, compared to one neural network and many SVMs. Fast R-CNN uses a multi-task loss function that combines classification and regression losses. The classification loss is computed using the log loss function over two classes. The regression loss is computed using the L1 smooth loss function.

4) FASTER REGION-BASED CONVOLUTIONAL NETWORK (FASTER R-CNN)
The three algorithms mentioned above, R-CNN, SPPNet, and Fast R-CNN, are based on a selective search to identify region proposals. Selective search is a slow and time-consuming method that impacts network performance and was proven to be the bottleneck of the entire process. Thus, the authors of Faster R-CNN [76] proposed a framework for object detection to replace the selective search algorithm and allow the network to discover region proposals. The Faster R-CNN point was that the region proposals depended on the image features previously calculated with the CNN forward passage (the first step in the classification). They have developed a region proposal network (RPN) [76] to generate region proposals directly, then predict bounding boxes. An RPN and Fast R-CNN model combined in Faster R-CNN [58]. Faster R-CNN takes the CNN feature maps and forwards them to the region proposal network. RPN utilizes a 3×3 sliding window that moves across these feature maps. Each sliding window location generates multiple potential regions and scores based on k fixed-ratio anchor boxes. Now we have bounding boxes in various shapes and sizes passed to the RoI pooling layer. Consequently, it is possible that region proposals may have no classes assigned to them after the RPN step. So, we can crop each proposal to make each proposal region include an object. That is what the RoI pooling layer is for. It extracts fixed-size feature maps for each anchor. These feature maps are then transmitted into a fully connected layer comprising a softmax and a linear regression layer. It then classifies the objects and predicts the bounding boxes for the detected objects. Only one CNN is applied in Faster R-CNN for region proposals and classification. Faster R-CNN is optimized for a multitask loss function comprising classification and regression loss.

The Region Proposal Network (RPN) is a Convolutional
Neural Network that proposes regions. At the same time, the second network is a Fast R-CNN for feature extraction and outputting the Bounding Box and Class Labels. The RPN is optimized for the given multitask loss function.

5) REGION-BASED FULLY CONVOLUTIONAL NETWORK (R-FCN)
R-CNN-based detectors, such as Fast R-CNN or Faster R-CNN, detect objects in two phases. First, generate region proposals (ROI) and classify and localize objects from the ROI. These detectors save valuable time by sharing calculations of repeated convolutional features for object classification and region proposals. However, Faster R-CNN still contains several unshared R-CNN's fully connected layers that must be calculated for each of the hundreds of proposals. The Region-based Fully Convolutional Network (R-FCN) [100] is a framework that combines the two main phases in a single model to take into account both the detection of the object and its position simultaneously. It contains only convolutional layers that provide complete backpropagation for training and inference. As we have observed in the methods mentioned above, that region proposal is mainly generated by RPN. The ROI pooling is performed and passes across fully connected (FC) layers to classify and regress the bounding boxes. The post-ROI pooling is not shared between ROIs and takes a long time. As a result, the FC layers add more parameters to the model, which leads to more complexity. In R-FCN, there is still RPN for region proposals. However, unlike the R-CNN series, the FC layers after ROI pooling are eliminated. As an alternative, the objective complexity is moved before the ROI pooling to generate feature maps, each dedicated to detecting a category at a specific location. For instance, a feature map is dedicated to detecting a dog, another for detecting a car, Etc. These feature maps rely on an object's spatial localization, called positionsensitive score maps. After the ROI pooling, all these regions' proposals will use the same score maps to carry out the average voting, a simple computation. Consequently, there is no learning layer after the ROI layer; in other words, R-FCN is significantly faster than Faster R-CNN and has a highly reliable mAP. R-FCN operates as follows; the input image is processed by the backbone ResNet-101 [59] to generate feature maps. These feature maps are transmitted on the one hand to an RPN to produce RoI and, on the other hand, to a fully convolutional layer for generating a bank of positionsensitive score maps. To have a score map, k 2 (C + 1), where k 2 is defined as the number of relative positions used to split an object in a grid. C + 1 is defined as the number of classes with a background. Afterward, on each ROI, they split it into the exact k 2 boxes or sub-regions as the scorecards. They check the score bank for each bin to ensure that it corresponds to the respective position of the object. In the upper left bin, for instance, we will search for the score maps that match the upper left corner of an object and average these values in the RoI region. The system performs this procedure for each class all over again. After each k 2 bin has a corresponding object value in each class, the k 2 bins are averaged to produce a unique score per class. They classify the RoI with a softmax on the remaining dimensional vector C + 1. They use convolution filters for the regression of the selection framework to generate the k × k × (C + 1) score maps used for classification purposes. An additional convolution filter generates a four × k × k map based on the same feature maps. The loss function for R-FCN is defined on each RoI and is the summation of the cross-entropy loss and the box regression loss. The classification loss (Lcls) and bounding box regression loss (Lreg) are used in online hard example mining (OHEM).

6) FEATURE PYRAMID NETWORKS (FPN)
The FPN [63] is not an object detector in itself. It is a feature detector that operates in combination with object detectors. For instance, with FPN, we can extract multiple feature map VOLUME 11, 2023 layers and feed them into an RPN to detect objects. Compared to the feature extractor used in some frameworks like Faster R-CNN, FPN generates more layers of feature maps, multiscale feature maps, and high-quality information than the standard feature pyramid used for object detection. Using FPN allows us to detect objects on various scales. The FPN consists of a bottom-up and a top-down pathway. The bottomup pathway is the traditional convolution network for feature extraction and uses a ResNet [59]. The spatial resolution decreases as we move upwards; as more high-level features are detected, each layer's semantic value is enhanced. As a reference set of feature maps, the output of the last layer of each stage will be used to enhance the top-down pathway through the lateral connection. The top-down pathway allows for higher-resolution layers from a semantic-rich layer. Whereas the reconstructed layers are semantic, the locations of the objects after all sub-sampling and bottom-up sampling are inaccurate. The authors include lateral connections between the reconstructed layers and the associated feature maps to address this problem to predict the most appropriate locations. In the top-down pathway, an oversampling by a factor of 2 is performed on the spatial resolution using the nearest neighbor to simplify the process. For each lateral connection, feature maps of the same spatial size are merged from the bottom-up and top-down pathways. In more detail, the feature maps of the bottom-up pathway are convolved at 1 × 1 convolutions to minimize the channel size. Moreover, the bottom-up and top-down feature maps are combined by element-wise addition. Then, a 3 × 3 convolution is applied directly to each merged map to compute the final feature map, designed to minimize the frequency folding effect of oversampling. The final set of feature maps is called P2, P3, P4, P5, which refers to C2, C3, C4, C5, and both have the same spatial size, respectively.

7) PANET
The Path Aggregation Network (PANet) [66] is a method mainly developed, for instance, segmentation, which inserts an additional upward path aggregation network above FPN. They provide an adaptive feature pooling that shortens the distance between the lower and topmost feature levels by grouping the features of all feature levels and avoiding arbitrarily assigned outputs. PANet allows the network to decide which features are useful. They use a complementary path to enhance the feature of each proposal by providing accurate localization signals in lower layers and generating a bottomup augmentation. The PANet obtained an accuracy of 41.4 on the MS-COCO dataset compared to Mask R-CNN, which achieved only 36.4%. PANet uses ResNeXt-101 as a backbone.

8) TRIDENTNET
The TridentNet model [89] proposes an approach to deal with the scale variations in object detection based on generating in-network scale-specific feature maps using uniform representational power. They build a parallel multi-branch architecture and apply scale-aware training, where each branch shares the same transformation parameters but with different receptive fields. The model applies a fast inference method with only one major branch to boost the model performance without using additional parameters and computations. The authors TridentNet achieved an mAP of 48.4 on the MS-COCO dataset with Resnet-101 as a backbone.

10) COPY-PASTE
In [101], the authors applied the copy-paste data augmentation strategy and proved its effectiveness for object detection and instance segmentation. The copy-paste method chooses two images randomly and applies a random scale jittering and a horizontal flip. It generates new data by pasting objects from one image to another. In the final stage, they tune the ground truth annotations for the bounding boxes by eliminating all occluded objects. The authors provide a self-training Copy-paste where a supervised model is trained on labeled data, producing pseudo labels on unlabeled data. Combined with Cascade Eff-B7 NAS-FPN, they achieved an AP of 55.9% on the MS-COCO dataset. Table 4 lists a chronological comparison of the strengths and limitations of the two-step anchor-based detection methods mentioned earlier in this paper.

C. ONE-STAGE METHODS
One-stage anchor-based detectors are characterized primarily by their computational and runtime efficiency. These models directly classify and regress predefined anchor boxes instead of using regions of interest. The SSD was this category's first well-known object detector [73]. The major challenge encountered in this type of detector is the imbalance between positive and negative samples. Several approaches and mechanisms have been implemented to overcome this problem, such as anchor refinement and matching [102], training from scratch [103], [104], multi-layer context information fusion [105], [107], and feature enrichment and alignment [69], [108], [111]. Other works have been directed toward developing new loss functions [79], [112] and new architectures [113], [114].

1) YOLOv2
YOLOv2, or YOLO9000 [80], published in 2017, is an object detection model capable of detecting more than 9,000 object categories in real-time. It has many updated features to fix the problems of the first version. The main improvements in YOLOv2 compared to YOLOv1 [72] are the application of batch normalization over the entire convolutional layers. Besides training 224 × 224 images, it uses 448 × 448 images to fine-tune the classification network over ten periods on ImageNet [115]. Using 416 × 416 images during training eliminates a pooling layer for better output resolution, removes all fully connected layers, and replaces them with anchor boxes for predicting bounding boxes. The model achieved 69.2% mAP and 88% recall with the anchor boxes; without them, it achieved 69.5% mAP and 81% recall. Although the mAP is slightly reduced, its recall has a high margin increase. As with Faster R-CNN [76], the anchor box sizes and scales were pre-set beforehand. YOLO9000 relies on k-means clustering to achieve interesting IOU scores because the standard Euclidean distancebased k-means often have additional errors when dealing with larger boxes. Using an IoU clustering approach with nine anchor boxes, Faster R-CNN obtained 60.9%, whereas YOLO900 achieved 67.2%. For YOLOv2, the location is defined by the logistic activation, thus reducing the value between 0 and 1, compared to YOLOv1, which has no constraints on the location prediction. YOLOv2 predicts multiple bounding boxes per grid cell. To compute the loss for the true positive, only one of them should be responsible for the object. For this purpose, the one with the highest IoU (intersection over union) with the ground truth is selected. YOLOv2 loss function has three parts: finding bounding-box coordinates, bounding-box score prediction, and class-score prediction. All of them are Mean-Squared error losses and are modulated by some scalar meta-parameter or IoU score between the prediction and ground truth.

2) YOLO v3
The YOLO [72] algorithm uses a softmax function to convert the scores into probabilities equal to one. YOLOv3 [116] applies a multi-label classification, and the softmax layer is substituted with an independent logistic classifier to calculate the input's probability of being part of a particular label. Rather than applying the mean square error to compute the classification loss, YOLOv3 applies a binary cross-entropy loss for every label. In addition, it minimizes the cost complexity of calculations by bypassing the SoftMax function. It provides additional minor enhancements. It performs prediction at three scales, precisely by downsampling the input image dimensions by 32, 16, and 8, respectively. Darknet, in this version, has been extended to include 53 convolutional layers. Detections in several layers are a good solution for solving the problem of small object detection, a common concern with YOLOv2. YOLO v3 uses a total of 9 anchor boxes. Three per each scale. It relies on K-Means clustering to generate all nine anchors. Next, the anchors are identified in descending order of one dimension. The first scale allocates the three most prominent anchors, the second assigns the following three anchors, and the third one the last three. YOLOv3 has more bounding boxes predicted than YOLOv2. For the same 416 × 416 image, YOLOv2 has 13 × 13 × 5 = 845 boxes; at every grid cell, a total of 5 boxes were detected with the use of 5 anchors, as opposed to YOLO v3, which predicted boxes at three distinct scales, totaling 10,647 predicted boxes for an image with the size of 416 × 416. In other words, it predicts ten times more boxes than the total predicted by YOLO v2. For each scale, every grid can predict three boxes using three anchors. Since there are three scales, nine anchor boxes are used. YOLOv3's loss function of YOLOv3 is defined from three aspects: the bounding box position error, the bounding box confidence error, and the classification prediction error between the ground truth and the predicted boxes. YOLOv3 predicts an objectness score for each bounding box using logistic regression. The first aspect of the loss function is the bounding box position error. The error is calculated by summing up the squared differences between predicted and true values of a bounding box's x, y, w, and h coordinates multiplied by a lambda coefficient that controls its importance to other losses. The second aspect is the bounding box confidence error which measures how confident YOLOv3 is that there is an object in a given bounding box. This term uses binary cross-entropy loss to calculate how well it predicts whether or not there is an object in a given cell. Finally, classification prediction error measures how well YOLOv3 predicts an object's class. It uses binary cross-entropy loss for each label.

3) SSD
Single Shot MultiBox Detector (SSD) [73] is an object detection framework published after R-CNN and YOLO. It was developed by W. Liu et al. to predict bounding boxes and class probability in a one-time process using an end-to-end CNN architecture. It is typically faster than the faster R-CNN. The SSD allows a one-time shot to detect several objects in the image instead of the two shots required for the region proposal network methods listed in the previous section. As a consequence, SSD is considerably more time-saving compared to region-based approaches. An image is introduced as an input through a VGG-16 [58] network to extract feature maps. Several convolutional layers are added with different filter sizes (19 × 19, 10 × 10, 5 × 5, 3 × 3, and 1 × 1). These and the 38 × 38 feature map produced by conv4_3 of VGG are the feature maps that 3 × 3 convolution filters will process for each cell to make predictions. There are k-bounding boxes for each location in the feature maps. These k-boxes are of various sizes and aspect ratios. On each bounding box, we calculate the C class scores and four offsets about the original shape of the default bounding box. Each box has four parameters and a probability vector corresponding to the confidence given to each object class. SSD involves negative sampling to determine poor predictions. It applies the non-maximal suppression technique at the end of the model, like YOLO, to maintain the more appropriate boxes. Afterward, the Hard-Negative Mining (HNM) method is applied to ensure faster and more stable training. They select the negative examples according to the highest confidence value assigned to each default box and then select the high ones to ensure that the negative and positive ratio is below 3:1.
The SSD loss function combines localization and confidence loss. The localization loss is the mismatch between the ground truth box and the predicted boundary box. SSD only penalizes predictions from positive matches. Negative matches can be ignored. The confidence loss is a softmax loss over multiple confidence classes (c). During training, the set of default boxes and scales for detection is essential. The SSD uses smooth L1 loss as its regression loss function. It is a particular case of Huber Loss with δ = 1. Smooth L1 loss combines L1 Loss and L2 Loss. When |a| is less than or equal to 1, it behaves like an L2 loss. One-hot encoding turns the label y into a probability distribution.

4) RETINANET
RetinaNet [79] is a single-stage object detector such as SSD and YOLO that offers almost the same performance as twostage detectors such as Faster R-CNN. This paper's significant contribution is a new loss function called a focal loss for classification, which has significantly increased accuracy. RetinaNet is a single, composite network consisting of the leading backbone network called Feature Pyramid Net, which relies on ResNet (ResNet50 or ResNet101) and two taskspecific sub-networks. The backbone network calculates the convolutional feature map for the entire input image. The first subnetwork is used to classify the output of the backbone, while the second subnetwork network is used to perform bounding box regression using the backbone's output. Because of its fully convolutional structure, RetinaNet allows the network to take an image of random size and generates feature maps with proportional sizes at several levels in the feature pyramid. In the classification sub-network, a fully convolutional network is associated with each level of FPN. For each anchor A and K object class, it predicts how probably there will be objects in each spatial position. There are four 3 × 3 convolution layers with 256 filters in addition to ReLU activation [117]. A further 3 × 3 convolutional layers are applied with a K × A filter, followed by sigmoid activation at the outputs. Focal loss is applied as a loss function. For the subnetwork, parameters are shared at all levels. As a result, the shape of the output feature map has the following dimensions (W, H, KA), which correspond to the feature map width and height, and K. A denotes the object's class and anchor box values. The regression subnetwork is associated with each FPN feature map parallel to the classification subnetwork. The regression subnetwork is designed the same way as the classification subnetwork; the only difference is that the parameters are not shared, with the last convolution layer consisting of 3 × 3 and 4 filters. Therefore, the output feature map would be in the shape of (W, H, 4A).
RetinaNet utilizes a focal loss function to address class imbalance during training. RetinaNet's focal loss function

5) MEGDET
MegDet [118] is a model that tackles the object detection task from the batch size factor. The authors propose a Large Mini-Batch size of 256 instead of 16 during training.
They use a Cross-GPU batch normalization with 128 GPUs and a warmup learning rate policy to train the whole network in a suitable time. MegDet achieved an mAP of 50.6% on the MS-COCO dataset using the ResNet-50 as a backbone and the OHEM technique. They finished the model training in four hours. The MegDet paper does not describe a specific loss function by name. However, it mentions that the shape of the regression loss function (parameters of SmoothL1 Loss) is used in MegDet. VOLUME 11, 2023 EfficientDet [67] is an object detection model that relies on the pretrained EfficientNet [60] backbones, a weighted bidirectional feature network, and a personalized compound scaling technique. The bidirectional feature network takes the level features 3 to 7 from the efficient net and applies top-down and bottom-up bidirectional feature fusion. The class and box network weights are shared between all levels of features. EfficientDet7 achieved an AP of 52.2% on the MS-COCO dataset using the EfficientNet-B7 as a backbone. EfficientDet uses a focal loss function for dense object detection. However, the EfficientDet paper mentions that the detection head and loss function are replaced with a segmentation head and loss function to perform segmentation tasks.

9) YOLOv7
YOLOv7 [233] is a faster and more accurate real-time for computer vision tasks. Like Scaled YOLOv4 [199], YOLOv7 backbones do not use ImageNet pre-trained backbones. YOLOv7 weights are trained using Microsoft's COCO dataset, and no datasets or pre-trained weights are used. The official paper demonstrates how this improved architecture surpasses all previous YOLO versions and all other object detection models in terms of speed and accuracy. YOLOv7 improves speed and accuracy by introducing several architectural reforms. The larger models in the YOLO7 family are YOLOv7-X, YOLOv7-E6, YOLOv7-D6, and YOLOv7-E6E. Other variations include YOLOv7-X, YOLOv7-E6, and YOLOv7-D6, which were obtained by applying the proposed compound scaling method to scale up the depth and width of the entire model. Table 5 lists a chronological comparison of the strengths and limitations of the one-stage anchor-based detection methods mentioned earlier in this paper.

VII. ANCHOR-FREE DETECTORS A. YOLOv1
YOLO [72] has a different approach to object detection. It captures the complete image in a single instance. It then predicts both the coordinates of the bounding boxes for regression and the class probabilities with only one network in one evaluation. Thus, his name is YOLO; you only look once. The power of the YOLO model ensures real-time predictions. The input image is split into an SxS grid of cells to perform detection. A single grid cell is supposed to predict every single object in the image, and this is where the object's center falls. Each cell will predict B potential bounding boxes with each bounding box's C class probabilities value, with a total of SxSxB boxes. Since the probability of most of these boxes is relatively small, the algorithm excludes those boxes that fall below a minimum specific probability threshold. A nonmaximal suppression procedure is applied to all left boxes, removing all possible multiple detections and keeping the most accurate objects. A CNN based on the GoogLeNet [120] model, which includes the initial modules, has been applied. The network architecture includes 24 convolutional layers and two fully connected layers. The reduction layers of 1 × 1 filters, followed by convolutional 3 × 3 layers, replace the primary inception modules. As a result of the final layer, a tensor of S * S * (C + B * 5) is obtained that equals the predictions of each grid cell. The total estimate of probabilities for each class is called C. The number of anchor boxes in each cell is indicated by B, with an additional four coordinates and a confidence value for each cell. YOLO has three loss functions, one for the abjectness score and two others for the coordinates and classification errors. The latter is calculated when the abjectness score is greater than 0.5. YOLOv1 loss function is divided into three parts: the one responsible for finding the bounding-box coordinates, the bounding-box score prediction, and the class prediction. The final loss function is a sum of these three parts.

B. CORNERNET
CornerNet [74] is an object detection model that uses key points to detect the object bounding box. It uses a convolutional neural network to detect objects as paired keypoints from the top-left and bottom-right corners. Those corners are represented as heatmaps, one for the top-left corners and the other for the bottom-right corner. Each corner has only one ground truth positive location, while all the remaining locations are identified as negative. This technique prevents the model from using traditional anchors employed in other object detectors. The authors also propose a new type of pooling layer named corner pooling that aims to localize corners efficiently. CornerNet uses the Hourglass-104 backbone and achieved an accuracy of 40.5% in the MS-COCO dataset and 42.1% using multi-scale training in the same dataset. CornerNet uses associative embedding, where the network predicts similar embeddings for corners belonging to the same object and a loss function similar to the triplet loss. In addition, it proposes a new variant of focal Loss as a loss function, which dynamically adjusts the weights of each anchor box.

C. EXTREMENET
ExtremeNet [121] uses a bottom-up approach to detect objects. They use a standard keypoint estimation network to identify the object's center point and its four extreme points: top, right-most, left, and bottom-most. These four extreme vital points are used as the object bounding box in a purely geometric way. The model uses the Hourglass-104 as a backbone and obtained an accuracy of 43.7% and 40.2% on the MS-COCO dataset for the single-scale and multi-scale testing, respectively. The ExtremeNet paper does not describe a specific loss function by name.

D. REPPOINTS
RepPoints [78] stands for representative points, a technique representing objects as a set of sample points. Since the traditional bounding boxes provide a coarse localization and extraction, RepPoints use points to localize and identify objects. The reppoint technique does not use anchors to sample the space of bounding boxes. Instead, it learns to automatically process the ground truth localization and recognition targets by limiting the spatial extent within an object and identifying the semantically relevant local areas. The authors proposed object detection model is RPDet [78] based on the RepPoints representation combined with deformable convolution. RPDet used ResNet-101-DCN as a backbone and obtained an accuracy of 42.8% and 46.5% in multi-scale training and testing. RepPoints paper describes two sets of RepPoints, one driven by the points distance loss alone and the other by a combination of the points distance loss and the center-ness loss.

E. FSAF
The authors propose a Feature Selective Anchor-Free (FSAF) module [122] to solve two problems faced in anchor-based single-shot detectors with feature pyramids; the heuristicguided feature selection and the overlap-based anchor sampling. While training multi-level anchor-free branches, the FSAF module applies online feature selection while training VOLUME 11, 2023 the multi-level anchor-free branches, improving baselines with tiny inference overhead. Each instance is linked to the appropriate feature level to optimize the network. The model encodes those instances following an anchor-free approach to learn the parameters for classification and regression. The authors experiment with applying the FSAF module with other anchor-based branches, such as the RetinaNet baseline, and yield excellent results with free inference overhead. The proposed model achieved a currency of 44.6% on the MS-COCO dataset. The FSAF paper does not describe a specific loss function by name. However, the FSAF module uses focal loss for non-ignoring regions and a 4-channel feature map for the bounding box regression subnet.

F. FCOS
In addition to being an anchor-free detector, the Fully Convolutional One-Stage Detection (FCOS) [75] is also a proposalfree detector. Similar to semantic segmentation, FCOS relies on the per-pixel technique to detect objects, avoiding all the hyperparameters and the complexity of overlapping in training. FCOS uses Non-maximum suppression (NMS) for postprocessing and filtering the bounding boxes, which improves accuracy. FCOS achieves an accuracy of 44.7% in MS-COCO using the ResNeXt-64 × 4d-101-FPN as a backbone. The authors use the FCOS as a region proposal network in two-stage object detectors, such as Faster R-CNN. The loss function used in FCOS combines three losses: focal loss for classification, IoU loss for regression, and center-ness loss. The focal loss addresses the class imbalance problem by down-weighting the easy examples and up-weighting the hard ones. The IoU loss measures the overlap between predicted bounding boxes and ground-truth boxes. The center-ness loss encourages the network to predict more accurate bounding boxes by penalizing predictions far from the objects' center.

G. ATSS
Adaptive Training Sample Selection (ATSS) [123] is a method developed to deal with the gap between center-based anchor-free and one-stage anchor-based detectors, depending on how they define the positive and negative training samples. The proposed model can automatically define the positive and negative training samples based on the object's statistical characteristics. The positive and negative samples are used for classification, while the negative ones are for regression. The Adaptive Training Sample Selection technique, had no hyperparameters compared to previous techniques. The authors also mentioned that tiling multiple anchors per location is crucial during object detection. ATSS used the ResNets as a backbone and obtained the highest accuracy with ResNeXt-64 × 4d-101-DCN with 47.7% and 50.7% on multi-scale. The Adaptive Training Sample Selection (ATSS) method automatically selects positive and negative samples based on object characteristics using statistical characteristics to calculate dynamic thresholds. However, the ATSS paper does not describe a specific loss function by name.

H. OTA
The authors propose an Optimal Transport Assignment [124] technique based on the optimization theory. The model deals with the label-assigning stage in object detection as an Optimal Transport problem by defining the transportation between each anchor and ground truth pair. The technique uses a cost-effective transport of labels from ground-truth objects and backgrounds to anchors using the Sinkhorn-Knopp Iteration [125]. Based on the Intersection-over-Union values between the predicted bounding boxes and each ground truth, they provide a new simple estimation strategy to identify the positive labels each ground truth needs. Compared to previous one-stage detectors, OTA can cope with the assignment of ambiguous anchors by assigning them manually using hand-crafted rules before applying the optimal transport assignment. OTA achieves excellent results on the MS-COCO dataset with an accuracy of 49% and 51.5% on multi-scale testing. The authors tested their method with several backbones and obtained the best results using the ResNeXt-64 × 4d-101-DCN. The Optimal Transport Assignment (OTA) loss function is a label-assigning procedure in object detection that aims to transport labels from groundtruth objects and assign them to anchor boxes.

I. DSLA
DSLA [126] stands for Dynamic Smooth Label Assignment and is a recent anchor-free detector published to solve the inconsistency problems in previous detectors. They improve the transition between positive and negative samples by improving the centeredness representation suggested in FCOS and providing an interval relaxation strategy. The Intersection-of Union is coupled with the smooth label with a value between 0 and 1 to supervise the classification branch, which is merged with the quality estimation branch, resulting in a more simplified anchor-free model with good localization quality. The IoU is predicted dynamically during training. The authors tested the DSLA model with several backbones, such as Resnet-50, Resnet-50-DCN, ResNeXt-101-64 × 4d-DCN, and the Swin-S. With the Swin-S backbone, they achieved remarkable results on the MS-COO dataset, reaching 49.2%. DSLA improves the performance of detection models with adaptive label assignment algorithms and lower bounding box losses for those positive samples indicating more samples with higher-quality predicted boxes are selected as positives.

J. YOLOv8
YOLOv8 2 is a state-of-the-art object detection, image classification, and instance segmentation model developed by Ultralytics. It is designed to be fast, accurate, and easy to use. YOLOv8 builds upon the success of previous YOLO versions and introduces new features and improvements to boost performance and flexibility further. It can be trained on large datasets and run on various hardware platforms, from CPUs to GPUs. One key feature of YOLOv8 is its extensibility. It supports all previous versions of YOLO, making it easy to switch between different versions and compare their performance. This makes YOLOv8 an ideal choice for users who want to take advantage of the latest YOLO technology while still being able to use their existing YOLO models.YOLOv8 includes numerous architectural and developer-convenience features, making it an appealing choice for a wide range of object detection and image segmentation tasks. The architecture of YOLOv8 changed from a simple version to a more complex one, with new convolutional layers and a new detection head. Compared to YOLOv5, it replaces the C3 module with the C2f module. Table 6 lists a chronological comparison of the strengths and limitations of the anchor-free object detection methods mentioned earlier in this paper.

VIII. TRANSFORMER-BASED DETECTORS A. VIT
ViT, published in [127] and inspired by transformers in NLP tasks [128], [129], was the first object detection model to apply transformers directly to images instead of combining convolutional neural networks and transformers. ViT splits the image into patches by providing the sequence of linear embeddings of these patches as an input to a Transformer. The model processes the patches as a sequence of words like tokens processed in Natural Language Processing. A constant latent vector is used to flatten and map the patches to the vector size dimension with a trainable projection in all the transformer layers. They used an MLP with one hidden layer during the classification during the pre-training time and one single layer at the fine-tuning time. The ViT achieved the highest performance when trained on larger datasets when they were first published. The Vision Transformer (ViT) paper does not describe a specific loss function by name.
However, the ViT model outputs hidden raw states without any specific head on top. It can be used as a building block for various computer vision tasks such as image classification.

B. DETR
The DEtection TRansformer (DETR) presented in [130] is the first end-to-end object detection model based on transformers. It consists of a pretrained CNN backbone and transformer. The model uses Resnets as a backbone to generate the lower dimensional features, which will then be formatted into a single set of features and added to a positional encoding, fed into a Transformer. The transformer creates an end-to-end trainable detector. The transformer is based on the original transformer [131]. It consists of an Encoder and a Decoder, removing hand-crafted modules like anchor generation. The transformer encoder takes image features and position encodings as input and directs the result to the decoder. The decoder processes those features and transmits the output into a fixed number of Prediction Heads, which consist of a predefined number of feed-forward networks. Each prediction head's output has a class and bounding box. Multi-head attentions in the decoder modify these object queries with encoder embeddings to generate results passed through multi-layer perceptrons to predict class and bounding boxes. DeTR uses bipartite matching loss to find the optimal one-to-one matching between detector output and padded ground truth. DETR generates a predefined number of predictions, each computed in parallel. DETR proposes a set-based global loss that forces unique predictions via bipartite matching. The DETR model approaches object detection as a direct set prediction problem and consists of a set-based global loss, which is the sum of the classification loss and the bounding box regression loss.

C. SMCA
The SMCA model [132], published in 2021, was an alternative to improve the DETR model convergence. To train DETR from scratch, it needs about 500 epochs to achieve the best results. SMCA proposes a mechanism called Spatially Modulated Co-Attention to improve the convergence of DETR. The SMCA model only replaces the co-attention mechanism in the DETR decoder by applying location-aware co-attention. This new feature constraints co-attention responses to be high near initially estimated bounding box locations. Training SMCA takes only 108 epochs and achieves better results than the original DETR, and demonstrates potential processing of global information.

D. SWIN
The Swin Transformer [133] seeks to provide a transformerbased backbone for computer vision tasks. The word Swin stood for Shifted window and was the first time to apply the shifted window concept used in CNN in transformers. It uses patches as in the ViT model by splitting the input images into multiple, non-overlapping patches and converting them into embeddings. Numerous Swin Transformer blocks are then applied to the patches in 4 stages. Each successive stage reduces the number of patches to maintain hierarchical representation, compared to ViT, which uses patches of one size. These patches are converted linearly into C-dimensional vectors. It computes self-attention only within the local window as the transformer block comprises local multi-headed self-attention modules based on alternating shifted patch windows in successive blocks. Computation complexity becomes linear with image size in local self-attention, while a shifted window enables cross-window connection and reduces complexity. Each time the attention window shifts concerning the previous layer. Swin utilizes comparatively higher parameters than convolutional models.

E. ANCHOR DETR
In [134], the authors propose an end-to-end object detection model based on transformers with a novel query design. Their novel query design is based on the anchor points to solve the absence of explicit physical meaning of learned object queries, which makes the optimization process difficult. The anchor points were used before in CNN-based detectors, and applying this mechanism lets the object query focus on the objects near the anchor points. The Anchor DETR model can predict multiple objects at one position. To optimize the complexity, they use an attention variant, Row-Column Decoupled Attention, that reduces the memory cost without sacrificing accuracy. The primary model uses ResNet-101 as the backbone with a DC5 feature and achieves an accuracy of 45.1% on MS-COCO with considerably fewer training epochs than DETR. The authors proposed anchor-free, RAMfree, and NMS-free variants.

F. DESTR
DESTR [135], published recently, proposed solving some previous transformer problems, such as the Cross and selfattention mechanisms and the transformer's decoder content VOLUME 11, 2023 query initialization. The authors propose a new Detection Split Transformer that divides the content embedding estimation of cross-attention into two independent parts, one for the classification and the other for box regression embedding. By doing this, they let each cross-attention deal with its specific task. For the content query initialization, they use a mini-detector to learn the content and initialize the positional embedding of the decoder. It is equipped with heads for classification and regression embeddings. Finally, to account for pairs of adjacent object queries in the decoder, they augment the self-attention by the spatial context of the other query in the pair. Table 7 lists a chronological comparison of the strengths and limitations of the two-step anchor-based detection methods mentioned earlier in this paper.

IX. PERFORMANCE ANALYSIS AND DISCUSSION
This section tests and compares all object detection models in the three benchmark databases in the object detection field. Pascal Voc 2007, Pascal Voc 2012 and MS-COCO. The column ''data'' in the following tables refer to training data.

A. PASCAL VOC 2007
The results of the tests are listed in Table 8.

B. PASCAL VOC 2012
The results of the tests are listed in Table 9.

C. MS-COCO
The results of the tests are listed in Table 10.

D. TESTING CONSUMPTION
All the frameworks listed below are tested using the Nvidia Titan X GPU (Maxwell architecture) for all experiments, facilitating speed comparison with earlier experiments, as they used the same GPU.

1) PASCAL VOC07
The results of the tests are presented in Table 11.

2) MS-COCO
The results of the tests are presented in Table 12.

E. DISCUSSION
As we can observe in this survey, most of the tests were performed on the MS-COCO database. Indeed, its large size and rich annotations allow us to evaluate the models on a wide range of images and give a clear picture of how the models generalize. Tables 6 and 7 show that all the models that achieved the highest mAP on Pascal VOC 2007 fall into the anchor-based detectors. All the leading five models belong to the two-stage approach, except for ScratchDet++, which follows the one-stage approach. Copy-Paste achieved an mAP of 88.6% by combining EfficientNet-B7 and NAS-FPN as a backbone. Moreover, it reached an mAP of 89.3% when pre-training on MS-COCO. Copy and paste highlights the importance of copy-and-paste data augmentation. SNIPER, ScratchDet, ACoupleNet, and Faster R-CNN achieved the following mAPs: 86.9%, 86.3%, 85.7%, and 85.6%. Except for the Copy-Paste model, which uses EfficientNet-B7 NAS-FPN as its backbone, all other leading models use one of the following networks: ResNets, Root-Resnets, and VGGNets, proving the powerful performance of these models.
From Table 8, we remark that anchor-based detection methods are the models that scored the best mAPs on Pascal VOC 2012. We also notice that the one-stage anchorbased detectors surpass the two-stage anchor-based detectors, which was the opposite in the past. RefineDet512++ achieved the best mAP of 86.8% with pretraining on the MS-COCO dataset using VGGNet-16. In contrast, the highest mAP without pretraining on MS-COCO belongs to Reti-naNet500 using AP-Loss and ResNet-101 as the backbone, with an mAP of 84.5% when applying the multi-scale testing. ScratchDet300+, FSSD512, and BlitzNet obtained an mAP of 86.3%, 84.2%, and 83.8%, respectively. Similar to Pascal VOC 2007 results, the main backbones that achieved the best VOLUME 11, 2023 results were VGG networks, residual networks, and Root-ResNets.
For the models tested on the MS-COCO dataset, we can notice the intense competition between different approaches. The first four positions belong to different object detection approaches. So far, the Swin V2-G model based on transformers and the HTC++ backbone is the winner, with an mAP of 63.1%. Ranking second, we find Copy-Paste, which belongs to the anchor-based model family, with an mAP of 56.0%. Copy-Paste uses a combination of Cascade Eff-B7 and NAS-FPN. In third place, we find YOLOv4-P7, which falls into the anchor-free detector family with an mAP of 55.5%. YOLOv4-P7 uses the CSP-P7 network as its backbone. In fourth place, we have the EfficientDet-D7x model, which achieved an mAP of 55.1% and used the EfficientNet-B7 network as its backbone. EfficientDet-D7x belongs to the one-step anchor-based object detector family. In MS-COCO, the backbones that assisted in achieving an mAP greater than 50.0% are ResNets, ResNeXts, Efficient Nets, SpineNet, CSP, and HTC++. Table 11 shows that all the fast object detection algorithms belong to the one-stage anchor-based approach family when implementing object detection models in a real-time environment. However, achieving high accuracy with many frames per second is difficult, as in the case of Fast YOLO, which achieved 155 FPS while obtaining only 55.7% mAP. We can spot, for example, that a model like EFIPNet managed to have a balance. EFIPNet achieved an mAP of 80.4% and an impressive FPS of 111 and used VGGNet-16 as its backbone. RefineDet320 achieved an mAP of 80.0% and 40 FPS and used VGGNet as a backbone.
According to Table 12, we can observe that all the fast object detection models belong to the anchor-based single-step object detection models. In addition, we can see that some models have successfully balanced detection accuracy and runtime speed. For example, YOLOv4, which uses CSPDarknet-53, achieved an mAP of 41.2% with 54 FPS. EfficientDet-D2, which uses the Efficient-B2 backbone, achieved an mAP of 43.0% with 41.7 FPS. Furthermore, no two-stage object detector model has performed well in real-time. (FPS > 30). RDSNet has 17 FPS and an mAP of 36.0%. In comparison, the anchor-free detectors such as CornerNet or ATSS could only attain 4.4 and 7 FPS, respectively. Therefore, we conclude that the anchorbased one-step detectors are still the fastest. Figure 5, shows the evolution of the accuracy in the three datasets: VOC07, VOC21, and MS-COCO, between 2013 and 2022. The figure also displays the winning detection model for each year within each dataset. For VOC07 and VOC12, the accuracy is presented by mAP, while for MS-COCO, it is by mAP [.5,.95]. The chart shows that the accuracy has evolved in VOC07 from 58.5% in 2013 through the Model R-CNN BB to 89.3% in 2021 through the Copy-Paste model. This means an increase of more than 30%. The same in VOC12, with an increase in accuracy of over 33% during the same period. While in MS-COCO, there was an improvement in accuracy of 40% between 2015, with a value of 23.6 through ION and 63.1 in 2022 through the SwinV2-G model. We also note that accuracy is improved every year in the MS-COCO dataset. Thus, for example, in VOC12, the accuracy has stayed the same since 2017, remaining at the value of 86.8% realized by RefineDet. Likewise, in VOC07, the accuracy has only increased by 2.4% since 2018 with the introduction of Copy-Paste. Figure 6 shows the evolution of different types of object detection models in the MS-COO dataset between 2015 and 2022. It can be seen that anchor-based two-stage models were the first to be evaluated in MS-COO in 2015, followed by anchor-based one-stage in 2016, anchor-free in 2017, and transform-based in 2020. So far, the most successful family is the transform-based with SwinV2-G, followed by the anchor-based two-stage with SoftTeacher, then the anchorbased one-stage with DyHead, and finally, the anchor-free one-stage detectors with YOLOv4-P7. We note a difference of more than 7% between the best transformer-based detector, SwinV2-G, and the best anchor-free detector, YOLOv4-P7.
The anchor-based two-stage increased by 26%, starting with ION with an accuracy of 33.1 in 2015, reaching 59.1 in 2021 with the SoftTeacher model. For the anchor-based onestage detectors, in 2016, SSD achieved an accuracy of 28.8%, and in 2021 DyHead achieved an accuracy of 87.7%, representing an enhancement of 30%. DetNet101, a model of the anchor-free detector family, reached an accuracy of 33.8% in 2017, and in 2021 YOLOv4-P7 increased the accuracy by more than 21%, reaching 55.5%. The most recently published transformer-based detectors achieved the best results with SwinV2-G in 2022 with an accuracy of 63.1%, while the first pure model based on transformers, DETR, achieved only 44.9% in 2020. Figure 7 illustrates the number of detection models evaluated in MS-COCO by each detector family between 2015 and 2022. We find that 2018 was the most productive year with VOLUME 11, 2023 more than 30 published models, of which half were anchorbased two-stage models, and the other half were anchorbased one-stage methods, and with the publication of only one anchor-free model. We also notice that anchor-based twostage methods dominated the literature between 2015 and 2018 with more than 36 published models, whereas between 2018 and 2020, more than 36 anchor-based one-stage models were published. One can also spot that anchor-based models have evolved from 2015 to 2018. After 2018 they start losing proportion towards other detection families, such as anchor-free and transformer-based detectors. For example, more than 15 different models of the anchor-based two-stage family were introduced in 2018, while just one year later, only five models were released. In 2020, only two models were released, while more than six anchor-free detectors were released in the same year. As soon as they appeared in 2020, the transform-based detectors continuously expanded. Figure 8 shows that about half of the detection models based on deep learning and evaluated in the MS-COCO dataset were introduced between 2018 and 2019. Then after 2019, the number of published models decreased yearly, with a value of 14% in 2020, 11.6% in 2021, and 3.3% in 2022.

X. CONCLUSION AND FUTURE DIRECTIONS
In this paper, we presented an overview of the current state of object detection based on deep learning. We have provided the most detailed survey covering dozens of object detection models. We divided the models into four main approaches: two-stage anchor-based detectors, onestage anchor-based detectors, anchor-free detectors, and transformer-based detectors. We tested and evaluated all models in major object detection databases such as Pascal VOC and MS-COCO. We determined that single-stage detectors have improved and rival two-stage detectors' accuracy. Furthermore, with the emergence of transformers in vision tasks, transformer-based detectors have achieved peak results, such as Swin-L and Swin V2, which achieved an mAP of 57.7% and 63.1%, respectively, in the MS-COCO dataset.
Object detection is an active area of research that is constantly evolving, and there are several promising future directions that researchers are exploring.
1) Speed-accuracy trade-off: Increasing the accuracy of an object detection algorithm requires more computational resources and longer processing times. Decreasing the accuracy can lead to faster processing times but lower detection performance. Therefore, researchers consistently aim to improve the accuracy and speed of object detection algorithms by using more efficient architectures and training methods to enable real-time and low-power applications, especially in complex scenes with occlusions or cluttered backgrounds. 2) Tiny object detection: Tiny object detection is a specific case of object detection focusing on detecting and localizing very small objects in images or videos.
It remains challenging because extracting information from small objects with only a few pixels is difficult.
These objects may be so small that they are barely visible or partially occluded by other objects in the scene. Tiny object detection has many potential applications, such as detecting small animals in wildlife monitoring, identifying minor defects in manufacturing processes, and medical imaging. 3) 3D object detection: With the increasing availability of 3D sensors, there is a growing interest in 3D object detection. Unlike 2D object detection, which estimates the location and size of objects in a twodimensional image, 3D object detection involves estimating objects' position, orientation, and dimensions in three-dimensional space. 3D object detection can be helpful in applications such as augmented reality, robotics, and autonomous driving, where accurate knowledge of the 3D environment is necessary for navigation and interaction with the physical world. 4) Multi-modal object detection: involves detecting objects from multiple visual and textual sources, such as images, videos, and audio, enabling more comprehensive and accurate object detection in complex scenarios. Multi-modal detection can be helpful in applications such as autonomous driving, where multiple sensors detect objects around a vehicle. 5) Few-shot learning: Few-shot learning is an area of research that aims to develop algorithms to learn to detect objects from just a few examples. This is particularly useful when collecting large amounts of labeled data is difficult or expensive. Those models will work with limited data or in low-resource settings. Overall, the future of object detection using deep learning is promising, with many exciting developments for future research.