Toward Performing Image Classification and Object Detection With Convolutional Neural Networks in Autonomous Driving Systems: A Survey

Nowadays Convolutional Neural Networks (CNNs) are being employed in a wide range of industrial technologies for a variety of sectors, such as medical, automotive, aviation, agriculture, space, etc. This paper reviews the state-of-the-art in both the field of CNNs for image classification and object detection and Autonomous Driving Systems (ADSs) in a synergetic way. Layer-based details of CNNs along with parameter and floating-point operation number calculations are outlined. Using an evolutionary approach, the majority of the outstanding image classification CNNs, published in the open literature, is introduced with a focus on their accuracy performance, parameter number, model size, and inference speed, highlighting the progressive developments in convolutional operations. Results of a novel investigation of the convolution types and operations commonly used in CNNs are presented, including a timing analysis aimed at assessing their impact on CNN performance. This extensive experimental study provides new insight into the behaviour of each convolution type in terms of training time, inference time, and layer level decomposition. Building blocks for CNN-based object detection are also discussed, such as backbone networks and baseline types, and then representative state-of-the-art designs are outlined. Experimental results from the literature are summarised for each of the reviewed models. This is followed by an overview of recent ADSs related works and current industry activities, aiming to bridge academic research and industry practice on CNNs and ADSs. Design approaches targeted at solving problems of automakers in achieving real-time implementations are also proposed based on a discussion of design constraints, human vs. machine evaluations and trade-off analysis of accuracy vs. size. Current technologies, promising directions, and expectations from the literature on ADSs are introduced including a comprehensive trade-off analysis from a human-machine perspective.


I. INTRODUCTION
In recent years, Deep Learning (DL) techniques have been exhaustively utilised in a large variety of fields, and Convolutional Neural Networks (CNNs) are one of the most frequently used of them in solving real-time problems of computer vision tasks, as it enables most accurate acquisitions. The concept behind CNNs drew inspiration from the structure of biotic visual systems. In the 1960s, it was theorized that the cats' visual cortex is based on sensitively to smaller sub-regions in the brain, called receptive fields [1].
The associate editor coordinating the review of this manuscript and approving it for publication was Yonggang Liu . Two decades later, Fukushima [2] introduced the first CNN model, based upon [1] and referred to as Neocognitron, which was simulated and implemented on a computer. This network, designed by using multi-layer artificial neuron connections for the transformation of images, is still considered as the source of inspiration for CNNs. Then, LeCun et al. [3], [4] created a CNN model, referred to as LeNet-5, which specifically classifies handwritten digits and can be trained to recognize patterns from raw pixels through using the backpropagation algorithm [5].
Despite all the innovative approaches, LeNet-5 [4] was ineffective in the implementation of complex problems, such as image classification, requiring large training data and powerful processors for computation. Advancements in hardware accelerators, such as Graphical Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), etc., made these devices a suitable implementation choice for Machine Learning (ML) techniques [6]. Becoming widespread in the early 2010s, CNN models along with efficient training methods were released and implemented on GPUs [7]- [9]. Following that revolution, many CNN algorithms have been proposed addressing the needs of different areas, with computer vision and natural language processing seen as major application fields. Computer vision tasks to which CNNs have mostly been applied are image classification, object detection, face recognition, scene labelling, action recognition, human pose estimation, and document analysis. As for natural language processing tasks, the two main prevalent fields are speech recognition and text classification. Nowadays CNNs are being employed in a wide range of industrial technologies for a variety of sectors, such as medical, automotive, aviation, agriculture, space etc.

A. CNN DEPLOYMENT IN AUTOMOTIVE
Along with recent technological advancements, consumer expectations from vehicles have evolved to extending their digital lifestyles into vehicle cabins and connecting with outside worlds. Concurrently, growing traffic congestion, rise in numbers of new drivers, and changing priorities have directed the focus on safety-based systems as well. As a result, automakers have started to transition from horsepower to digital technology as a standard feature, enabling selfdriving capabilities in the production of automobiles, which will be referred to as Autonomous Driving Systems (ADSs) in the rest of the paper.
In the near future, the next generation of ADSs promises an advanced level of self-driving experience. To enable this advance, the industry needs to combine multiple technologies into a single system, which has the capability of communicating, collaborating, and eventually performing human functions for almost all driving scenarios. According to Intel [10], all these capabilities need numerous sensors, such as radar, lidar, GPS, and advanced camera technologies, capable of collecting the necessary information of ∼1 GB/second. This data should be processed in real-time in order to make safe decisions for the vehicle on the road. Despite the advantages of lidars, such as high precision and 3D mapping, their extreme cost prevents their commercial deployments in the industry. For example, existing lidars in the market are as expensive as $75k [11]. Thereby, some companies like Tesla [12] and Mobileye [13], [14] have announced that optical cameras and radars are their preferred choice of sensing devices in vision-based parts of ADSs. From this point of view, CNNs as an image-based technique have been effectively and widely utilised by automakers in object detector modules of automobiles.
Although there has been lots of advancements in object detection with CNNs for various applications, because of the stringent real-time limitations on both safe operational decision-making and latency of data processing, ADSs are notably challenging and still under experimentation. In fact, designing an end-to-end ADS remains an unsolved research topic [15]. Furthermore, the computational pipeline system of ADSs includes various modules executing multiple tasks, and every module needs improvements on their performances [16].

B. PAPER CONTRIBUTIONS
This paper reviews the state-of-the-art on the field of CNNs for image classification and object detection, and the field of ADSs in a synergetic way. The main contributions of the paper are as follows: The first main contribution of this paper is a detailed review of the literature on CNNs for both image classification and object detection, comparing them regarding their performance and model size. In recent years, the mainstream CNN research on image classification was focused on the design of two types of algorithms: heavyweight and lightweight algorithms. The former prioritizes accuracy performance regardless of other factors, whereas the latter aim to reduce the model size and the computational load achieving a satisfying accuracy. Therefore, these types of CNNs are reviewed as separate classes in the order of their historical evolution. The CNNs on object detection are also divided into two classes: two-stage detectors and one-stage detectors, which are discussed in a historical order too.
The second main contribution of the paper is a novel investigation of the convolution types and operations commonly used in CNNs for image classification that distinguish one model from another. The purpose is to establish the effect of these on the CNN performance. For that, their training and inference performances were evaluated using a CPU as well as two identical hardware accelerators (2xNVIDIA Pascal P100s GPUs). The evaluation was based on a reference CNN model, ResNet [17], for a fair assessment. The pursued strategy was to place and test different types of filters on the ResNet's modules. By means of experiments with two wellknown datasets CIFAR-10 and CIFAR-100 [18], a runtime timing analysis was carried out by decomposing the architectures into basic components.
The third main contribution of this paper is an overview of recent ADS related works and current industry activities, in which an attempt is made to bridge academic research and industry practice on CNNs and ADSs. Design approaches aimed at solving problems of automakers in achieving realtime implementations are also proposed based on discussion of design constraints, human vs. machine evaluations and a trade-off analysis of accuracy vs. size.
In other words, this paper is an up-to-date applicationbased review of CNNs. Unlike other published reviews, ours involves a full investigation end-to-end. The difference between our review and the previously published ones is shown in Table 1, where it can be seen that our paper is the VOLUME 10, 2022 most comprehensive study in terms of the scope of the work covered.
The rest of the paper is structured as follows. Section II introduces the main components of CNNs in terms of their mathematical and architectural properties as well as frequently used datasets and libraries in the scope of the survey. Section III provides a review of CNN models for image classification and examines the different convolution and operation types in an evolutionary manner using experimentally derived execution time, while Section IV reviews CNN models for object detection. Most recent ADSs related works concerned with CNN-based classification and detection, industry status, and ADSs system architectures are outlined in Section V, which also includes a discussion on design constraints and promising new directions. Finally, Section VI concludes the paper.

II. OVERVIEW OF CNNs
CNNs are a type of layered Deep Neural Networks (DNNs), which are composed of artificial neurons. They utilise several distinctive properties compared to other neural network types, such as local receptive fields, spatial subsampling, shared weights, etc. CNNs enable cooperation throughout multiple sequential stages, which consist of a convolution, padding, and pooling operation, adding nonlinearities via activation functions, and Fully Connected Layers (FCLs).
The input and output of each convolutional layer are known as feature maps. If the input is an image, each feature map is a 2D array containing colour channels and if a video, the feature map has to be a 3D array and 1D arrays for audio inputs. In every output stage along with the network, there are new features extracted from the pixels of its input, and by the convolutional operations, more distinguishing features are detected in the later layers.

A. CNN ARCHITECTURE
As an example of a typical CNN, the entire architecture of the popular CNN AlexNet [28] is illustrated in Fig. 1, which includes five convolutional layers with padding operations, three max-pooling layers, and three FCLs. In Fig. 1, I D , O D , K D denote the spatial height and width dimension of the inputs, outputs, and kernels, while I C , O C , K C are the channel numbers of those, respectively. In addition, F is the filter number, which is equal to the outputs O C , and the kernel dimension in the padding operation is symbolized by K P . A notable term that will be used in the layer-based definitions is tensor, which in the simplest form is a generalization of matrices to n-dimensional space.
The following sub-sections define all the architectural components of a CNN, based on the reference architecture AlexNet in Fig. 1. Weight number calculation and areas of usage are also discussed.

1) INPUT LAYER
Input layers introduce input data in the network, which normally represents an image structured as a data array of pixel values. Before feeding the image into a designed model, the spatial and channel dimensions of the input have to be reshaped according to the model or the used deep learning library specifications.

2) CONVOLUTIONAL LAYER
The core of a CNN is the convolutional layers. Learnable parameters in these layers form the kernels and the collection of the kernels in a layer is called a filter, which is subjected to a convolution multiplication through the full spatial depth of the input. A kernel defines the field of view on the convolution operation, whereas, as a tensor sample, a filter is the total number of kernels in a layer channel-wise. For instance, a filter with a size of ''m × n × c'' includes c kernels with ''m × n'' kernel dimension.
Via the convolution operation, a feature map is obtained with a new spatial dimension and channel number based on the input and filter dimensions. Based on the used kernel's parameter values, outputs are extracted with different features. In Fig. 1, all the convolutional layers are composed of standard spatial 2D convolutions. Consider a spatial convolution, which takes a tensor I ∈ R I D ×I D ×I C as the input, and the filter of the convolution is the tensor K ∈ R K D ×K D ×I C ×F . For simplicity, we assume the spatial dimensions of the input/output to be identical, which means that O D = I D , and the stride (i.e., the convolution kernel's step size) is 1. Then, the spatial 2D convolution outputs a tensor O∈R I D ×I D ×F , computed as O k,l,n = i,j,m K i,j,m,n I k+î,l+ĵ,m , whereî = i − I D /2 andĵ = j − I D /2 denote the recentred spatial indices; k, l and i, j are the indices over the spatial dimensions; whereas m, n provide indexation to the channels and filters.

3) PADDING OPERATION
Padding is the placement of several extra pixel grids to the input's spatial plane to handle the output's spatial dimensions. In case of a demand for an equal dimension of input and output tensors, the padding operation could enable it. As seen in Fig. 1, the output tensor's spatial dimensions are in the control of the padding operation. The effect of the padding operation in the output tensor's dimension (O D × O D ) is defined by

4) STRIDE OPERATION
The stride value indicates that the filter, which is the weight tensor for the convolution process, slides on the input tensor in increments of one or more-pixel steps. This is another parameter that directly affects the output size. The dimension of the output tensor (O D × O D ) is defined by where S is the stride step parameter.

5) ACTIVATION FUNCTIONS
CNNs use various activation functions, such as Rectified Linear Units (ReLU), Sigmoid, TanH, etc., to build the feature map, obtained through the convolution operation. These functions enable the introduction of nonlinearities to the layers, which increases learnability. In particular, ReLU has been frequently preferred in CNNs on the score of enabling several times faster training in comparison to the other activation functions.

6) POOLING OPERATION
Pooling layers take small-sized rectangular blocks, defined by K P , from the output feature map, and form subsamples from it to produce a single maximum, minimum, or average VOLUME 10, 2022 output from every block. After that, a new sampled feature map is formed, which is then used as an input for the next convolutional layer. By reducing the parameters of the feature map, the pooling layers allow to reduce the spatial size as well as the control of overfitting. As instance, in Fig. 1, following the convolutional layers and the application of the activation functions, there are three maximum pooling layers, represented with light amber layers. Additionally, global average pooling is another pooling method, used in [29] as well. It does a pooling to the whole image at a time without any sliding.

7) FULLY CONNECTED LAYER
The CNN architecture performs feature learning via the operations introduced above until the last convolutional and pooling layer output is derived. The final CNN stage consists of FCLs and performs a high-level classification. It is placed after the last convolution or pooling operation has occurred. The flattening operation in Fig. 1 provides the transition from convolutional later to FCL by converting the data to a 1-dimensional array. Essentially, in this stage of the architecture, a usual neural network process for the operation of classification takes place.

8) OUTPUT LAYER
This is the final stage, and the predicted result for the input image is obtained via different cost functions. For instance, in Fig. 1, Softmax multi-class classifier [30] is deployed. Thus, a cost is produced for a prediction by comparing the predicted result and the real data from the training set.

B. FULLY CONVOLUTIONAL NETWORKS
Fully Convolutional Networks (FCNs) are neural networks that are composed of only convolutional layers without adding any FCLs. The difference from CNNs is that FCNs are fully convolutional networks, which do all the learning with convolution-based filters even for decision making in the last layer, whereas CNNs include FCLs at the end of its architecture like Fig. 1. Another difference is that FCNs learns everything by using global information, whereas FCLs try to learn and make decisions only based on local spatial inputs.
A few examples of this type of networks are included in Section III and IV.

C. NUMBER OF CNN PARAMETERS
Definition of the output tensors' dimension and calculation of the number of parameters, memory accesses, and FLOPs in CNNs are crucial for understanding superior aspects of the different types of convolution filters discussed in the rest of the paper. This section, therefore, introduces the widely used method to calculate the number of parameters for frequently used 2D convolutions, based on the AlexNet architecture [28] in Fig. 1  is employable in convolutional and pooling layers with the difference that K D needs to be replaced by the kernel size of the pooling operation, K P . As for the FCLs, it is a length vector and thus, the output size is equal to the number of neurons in the particular layer.
To calculate the number of CNN parameters, we recognize that it can be represented by the sum of the number of parameters in three different CNN operations: the convolutional layers, the transition from convolutional layer to FCL, and the transition from FCL to FCL. Pooling layers do not include any parameters, as they are aimed at size reduction of the outputs.
The number of parameters in the convolutional layers can be computed by where B C denotes the bias number in the convolutional layer. In the transition from convolutional layer to FCL, the number of parameters is where N is the neuron number in the first FCL; I C an I D are the channel number and spatial dimension of the tensor previous to the first FCL; and B CF represents the bias number in the layer. In the transition from FCL to FCL, the number of parameters is where F P and B FF are the neuron number in the previous layer and biases respectively. Consequently, the total number of parameters in AlexNet, calculated by means of (4)-(6) defined above, is 62,378,344, as detailed in Table 2.

D. CNNs FOR IMAGE CLASSIFICATION AND OBJECT DETECTION
Before delving into the recent studies of CNNs for classification and object detection, it is essential to explain these two imaging tasks. In general, classifying an object in an image aims at assigning it to a certain category, while object detection involves localization and classification of all present objects. Fig. 2 illustrates the outcome of the two operations, where as a result of the image classification task, the image in Fig. 2 (a) is identified as a car; and as a result of object detection in Fig. 2 (b), cars, persons, and traffic lights are detected, and their positions and sizes are established by drawing a bounding box around the objects of interest in the image.
There is an obvious overlap between the two tasks, which requires that designs of high-performance object detection CNNs incorporate high-quality image classification properties in their architectures. In other words, an effective CNN architecture for object detection depends directly on the image classification quality of that CNN architecture.
In the rest of this section, frequently used datasets and libraries for the implementation of CNNs for image classification and object detection are introduced.
Representative CNN architectures for image classification and object detection are discussed in Sections III and IV respectively.

1) DATASETS
The learning capabilities of CNNs are obtained during training and heavily rely on the suitability and comprehensiveness of the available training datasets. In fact, features of datasets, such as collection scenarios, class numbers, resolution, etc., are crucial for CNN performance. In this section, we introduce the most widely cited datasets that are related to the scope of the paper along with their distinctive characteristics. For clarity of presentation, we divide the existing datasets into two classes: domain-general and domain-specific. Domaingeneral datasets here represent datasets that can be used for the training of CNNs for image classification or object detection, which then could be utilised for a particular application domain. For instance, ResNet-50 [17] has been trained on a domain-general dataset, ImageNet [31], and deployed for image classification tasks in various domains, such as medical [32], agriculture [33], autonomous driving [34], and so on. Contrary to domain-general, domain-specific datasets are originally created for a particular domain with the aim to facilitate learning of specialised domain-related features. Main features of renowned examples of both classes are summarised in Table 3, where the names in the left-most column are shaded light grey for domain-general datasets and dark grey for domain-specific datasets.
There exist two prominent domain-general datasets for image classification, which are CIFAR-10 & 100 [18] and ImageNet [31]. As an intermediate level dataset for image classification, CIFAR-10 & 100, have been introduced by one of the creators of AlexNet [28], Alex Krizhevsky. Both consist of a 50k training set and 10k test set fixed size 32 × 32 colour images. In CIFAR-10, each class is represented by 6k images, whereas in CIFAR-100 there are 600 images per class. As for ImageNet [31], it is regarded as an advanced level dataset, which has been developed with around 14M labelled high-resolution images according to the WordNet hierarchy with 22k subcategories. There has been also held a CNN competition since 2010 based on this dataset, referred to as the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Since the 2012 ILSVRC competition, the dataset has been kept unchanged and involves 1k subcategories, ∼1.3M training images, 50k images for validation, and 100k images for testing. In addition, the number of images for each class, termed 'synset' [31], ranges between 732 and 1300. The newly proposed CNNs for domain-general image classification in the academic community have been mostly trained on these two datasets. The main features of these datasets are summarised in Table 3 and further details are provided in Section III.
Frequently used domain-general object detection datasets according to the open literature are Pascal VOC (Visual Object Classes) [35], MS COCO (Microsoft Common Objects in Context) [36], and ILSVRC 2014 [31]. Pascal VOC is a standardised dataset for object recognition, and popular challenge competitions based on it had been run from 2005 to 2012. Pascal VOC 2012 has 11,530 images with 27,450 annotations and 20 object categories. The currently most popular object detection dataset, MS COCO, is a largescale dataset consisting of 328k images. It is suitable for multiple tasks, such as object detection, segmentation, keypoint detection, and captioning. Another widely used large-scale dataset in detector training is ILSVRC 2014, which is a part of the 2014 ILSVRC competition [31]. It comprises 450k training images with 200 object categories, 20k validation and 40k test images.
The main features of the datasets, outlined above, are summarised in Table 3 and further details are included in Section IV.
While there is no doubt of the usefulness of ImageNet [31] and MS COCO [36], their images often represent singleobject scenes, which are not representative enough of what is encountered in real-time self-driving situations. Besides that, these domain-general datasets lack the necessary content to support the learning capabilities of ADSs with regard to specific self-driving scenarios. For example, adverse weather conditions could lead to poor CNN detection performance in ADSs, if weather-related information has not been present in the training data. In consequence, various domain-specific datasets have been created including different self-driving scenarios to facilitate CNN classification and detection learning capabilities for ADSs.
The most cited self-driving-specific datasets developed in recent years, main features of which can be found in Table 3, are briefly overviewed below: (i) KITTI Vision Benchmark [37] is the most widely cited dataset [50] among researchers and developers since its publication year, 2012. It offers a relatively large amount of self-driving scenes with 2D, 3D, and bird's eye-view object detection datasets together with annotations.
(ii) UC Berkeley DeepDrive, BDD100K [38] is a recent dataset with 100k annotated videos and 10 different tasks for evaluation of self-driving image recognition algorithms. The dataset comprises more than 1k hours of driving experience and 1M frames. It also provides a large context diversity for self-driving by being strengthened with all-weather, geographic, and environmental conditions in the day/night time.
(iii) CityScapes [39] is a large-scale dataset and focuses on the semantic understanding of urban street scenes with 25k annotated images and 30 classes.
(iv) A2D2 [40] has been released by Audi and includes 41k labelled data with 38 features and 390k unlabelled data. Its size is around 2.3 TB and is split by annotation types, such as semantic segmentation, 3D bounding boxes, etc.
(v) nuScenes [41] has been created by a full sensor suite mounting six 360 • cameras, five radars, a 36-beam LIDAR, and a Global Positioning System and Inertial Measurement Unit (GPS-IMU). It contains over 1.4M images with a diversity of driving manoeuvres, traffic situations and unexpected behaviours.
(vi) ApolloScape [42] is a part of the Apollo project, which is an evolutive and ever-growing research project across all aspects of autonomous driving. The project website offers various simulation tools over 80k lidar point cloud, 100k street views, and 1000km trajectories for city traffic.
(viii) Leddar PixSet [45] was released in 2021. It consists of 29k frames in 97 sequences with more than 1.3M 3D bounding box annotations, collected by a full Autonomous Vehicle (AV) sensor suite (LIDARs, cameras, radars, and GPS-IMU). What presents this dataset novel is the deployment of a flash LIDAR with the inclusion of the fullwaveform data.
(ix) Oxford RobotCar [46] contains over 100 records of a consistent route during a year in Oxford, UK, which brings a long-term diversity approach in data collection. Its sensor suite comprises six cameras, LIDAR, and Global Positioning System and Inertial Navigation System (GPS-INS).
(x) Waymo [47] was collected by Waymo self-driving cars. More details about it are presented in Section V.
Apart from the above, more sophisticated types of training datasets are also available, for example, naturalistic driving datasets, such as NUDrive [51], euroFOT [52], etc., which focus on the human element in autonomous driving: the driver.

2) LIBRARIES AND FRAMEWORKS
In this subsection, the most frequently used libraries and frameworks in the CNN field are reviewed. Compared to other programming languages, Python-based machine learning libraries have been the go-to choice for researchers and developers as they offer a variety of features and flexibility, which can increase productivity and quality of code writing, implementation and integration.
The CNN coding process can be divided into two main stages: 1) data pre-processing and 2) building of the CNN algorithm. Several general-purpose libraries can be employed at the first stage to insert and prepare datasets for the second stage, such as, the multi-dimensional array-processing library NumPy [53], scientific computing library SciPy [54], data frame processing library Pandas [55], etc. A few powerful drawing libraries, such as matplotlib [56], seaborn [57], plotly [58], are available as well.
As regards the second stage, there are a large number of frequently used libraries and frameworks. The most popular ones are summarised in Table 4, and some of them are outlined below: (i) Google's Tensorflow [59] is an open-source ML library, available since 2015, which has been widely used. It enables the development of highly computational ML tasks by using data flow graphs in which edges signify tensors. Through this, a single Application Programming Interface (API) can distribute the load between multiple nodes, such as GPUs, and CPUs.
(ii) Theano [60] is compatible with numerical computations and simplifies code writing in deep learning. It also provides tight integration with NumPy [53] resulting in quite accurate calculations. (iii) Keras [62] is another widely used library that can be run on top of Theano or Tensorflow, as it is designed as a highlevel API. As such, it offers a productive and user-friendly interface reducing developers' cognitive load.
(iv) PyTorch [63] has been a fast-growing library in recent years. It has been upgraded by the Facebook AI Research Lab (FAIR) with many outstanding novel features, such as scalability for distributed training [28] and cloud support, offering design simplicity from research to production. In addition, it smoothly integrates with the Python data science library NumPy. In fact, PyTorch was ranked as the most-used framework in ML implementations, reported in the open literature since March of 2019 [66], as shown in Fig. 3.
However, Keras has still a better real-time performance than PyTorch. (v) Caffe [65] is a research-based library and has been written in C++ with Python interface by Berkeley AI Research (BAIR) and community contributors. Its remarkable features are speed and switching between CPU and GPU by a single flag. It also offers seamless integration with GPU training in image-based datasets. VOLUME 10, 2022 Other powerful Python-based libraries and frameworks, such as MXNet [64], Chainer [61], ScikitLearn [67], have also been commonly used.

III. REVIEW OF CNNs FOR IMAGE CLASSIFICATION
One of the primary objectives of image classification algorithms is to be deployed in systems aimed at solving different computer vision tasks. For instance, they can act as a backbone network, also known as a base network, in various object detection modules. In this respect, a distinctive classification network can play a crucial role in enhancing the performance of detection systems.
This section looks comprehensively at outstanding CNN models for image classification by grouping them into two classes based on their model size: heavyweight and lightweight designs. It can be said that improving and designing a new CNN architecture is based on understanding the design and historical relations of previous architectures. Therefore, the CNN architectures discussed below are considered in chronological order taking into account the architectural influences inspired by previous designs.
There are certainly numerous qualities that affect the CNN design process. However, the types of convolutions and specially designed operations could be seen as one of the most significant design parameters in convolutional networks. In this section, we attempt to briefly introduce the architectures of the majority of the outstanding image classification CNNs, published in the open literature, with a focus on their accuracy performance, parameter number, model size, and inference speed. The aim is to provide evolutionary insight, highlighting the progressive developments in convolutional operations. The focal point is mainly on the most popular convolutional processes shown in Fig. 4, even though, numerous other types of convolutions and operations have been proposed, such as 3D [68], dilated [69], spatially separable [70], flattened [71], etc. The parameters Fig. 4 are introduced in Section II.A and the spatial 2D convolution in Fig. 4 (a) is discussed in Section II.A.2) above. The convolutional processes, shown in Fig. 4 (b) -(f) are described in Section A and B below, as follows: Fig. 4 Fig. 4 using ResNet [17] as the reference network to achieve a fair comparison. Layer-based timing performance is also provided through the decomposition of the network layers and modules, where the operations take place.
The performances of the reviewed heavyweight and lightweight CNN designs are summarised in Table 5, where the Top-1 and Top-5 accuracy metrics are used. The Top-1 accuracy means that the highest probability prediction of the CNN model for an image must be exactly the expected answer or otherwise it fails, while the Top-5 accuracy requires that the top five highest probability predictions of the model must include the expected answer or otherwise it fails.

A. HEAVYWEIGHT CNN DESIGNS 1) AlexNet
As introduced in Section I.A, the concept of CNNs was introduced for the first time in 1998 by LeCun et al. [3], [4] and named LeNet-5. The model was developed using a multilayer artificial neural network including two convolutional, two pooling, and three fully connected layers, and had 60k parameters. As the ReLU activation function hadn't been improved in these years, the TanH activation function was mostly used in the learning process, whereas the sigmoid function was only used in the output layer. The average pooling method was employed rather than max pooling in the pooling layers. The training process was slow taking many days due to the type of processors at the time, however, nowadays this has improved significantly with the use of newgeneration GPUs.
LeNet-5 was classifying handwritten digits [72], MNIST (Modified National Institute of Standards and Technology), which is referred to as the ancestor of basic level image classification with 60k training and 10k test fixed-size images. By means of using a backpropagation algorithm [5], it could be trained to detect the patterns from raw pixels and to eliminate a separate feature detection. The method had a 0.95% accuracy rate performance in the dataset. Despite all these advantages, LeNet-5 was ineffective when applied to complex problems such as video classification, which needs a large set of training data and powerful processors for the computation. However, the novelty of the LeNet-5 design is enormous, as it has provided a standard 'template' for almost all subsequent CNNs.
Following the improvements on both compute-bound and memory-bound computing platforms, Krizhevsky et al. [28] upgraded the AlexNet architecture shown in Fig. 1, and it achieved the best score in the ILSVRC competition based on ImageNet [31] dataset in 2012. Though it was only assembling three more convolutional layers into the LeNet design, there were four notable architectural contributions: (i) the first implementation of the ReLU activation function, enabling six times faster training than the TanH function; (ii) the first deployment of the overlapping pooling technique in the pooling layers; (iii) the application of a data augmentation technique [73]; and (iv) the introduction of the grouped convolution.
AlexNet was implemented on two Nvidia GPUs GTX 580 with 1.5GB virtual memory per node to allow efficient network training as shown in Fig. 5. The design features two separate spatial 2D convolution paths grouping the convolution operation with two filter groups, which represents model-parallelization across two GPU nodes, also known as a model-distributed training method. In addition, a compact and efficient version of AlexNet, named CaffeNet, was executed on a single node NVIDIA GeForce graphics card GTX 1080Ti.
The 2D spatial and grouped convolutions are illustrated in Fig. 4 (a) and (d), respectively. In a grouped convolution, the number of kernel channels K C is equal to the number of the input channels I C , and the number of filters is separated into an equal number of sub-groups, which are responsible for a conventional spatial 2D convolution with the divided group depths. Unlike the AlexNet design with two filter groups in Fig. 5, we divide the kernel channels and filter number into three sub-groups as seen in Fig. 4 (d), whereby each filter group could be defined by the tensor [K D : As a result, each group creates O C /3 = F/3 output channels. Overall, three groups create 3 × O C /3 = 3 × F/3 output channels, equal to O C = F. The use of grouped convolutions in networks brings three essential benefits. The first one is making it possible to parallelise the CNN model, which allows executing the convolutions over multiple GPUs. The second benefit is that the network could be fed with more images per step compared to a single GPU implementation. Furthermore, each filter group can learn a unique representation of the fed data achieving more structured learning. The third benefit is the improved efficiency since the number of parameters decreases when the number of filter groups increases. This can be easily shown analytically. In a spatial convolution, the number of parameters is calculated by the expression F × K C × K 2 D , whereas in a grouped convolution it could be expressed as (F/3×K C /3×K 2 D ) × 3. Thus, it can be seen that with three filters a decrease in the parameter number by two thirds or 67% is achieved.
It can be observed from the above that grouped convolutions bring substantial advantages compared to standard spatial convolutions, representing another notable contribution of the AlexNet model.

2) ZFNet
One year later, Zeiler and Fergus [74] proposed ZFNet, the winner of the 2013 ILSVRC competition [31], which significantly reduced the classification error of AlexNet by visualizing the convolutional networks layer by layer and by adjusting layer hyper-parameters, such as filter sizes and strides.
Although there had been some observation and improvement of shallow layer features, it was the ZFNet authors who investigated deeper features of the pixel domain. Using the deconvolution and un-pooling techniques, they were able to visualize the convolutional networks layer by layer, as shown in Fig. 6. By means of the deconvolution technique, they also analysed and rearranged several hyper-parameters of their algorithm, which led to reducing the error rate.
In deconvolution, two problems were determined in the first two layers. There was only high and low-frequency information in the first layer, as well as the scarce number of mid-frequencies was causing a chain effect in the learning of some mid-frequency features. In the second layer, an aliasing problem, occurring in the event of a low level of the sampling frequency, was detected, which was due to the large stride in the convolution operation of the first layer. The researchers fixed the problems by reducing the filter size from to in the first layer and changing the stride step of the first layer to 2 instead of 4. As a result, a valuable performance increase was achieved.

3) VGGNet
VGGNet [75] is another heavyweight CNN, which was developed by VGG (Visual Geometry Group) from the University of Oxford. It was followed by many other improved models, such as VGG-11, VGG-11(LRN), VGG-13, VGG-16, VGG-16 (Conv1), and VGG-19, built on top of VGGNet. Unlike [28] and [74], which used 11 × 11 and 7 × 7 kernels respectively, it was based on the placing of 2 × 2 and 3 × 3 convolution kernels and is comprised of thirteen convolutional and three fully-connected layers by deploying the ReLU activation function. The aim was to go deeper into the layers in order to improve the accuracy. This work was highly important as it allowed to reduce the error rates in CNN image classification. However, that entailed a significant increase of the parameters and model size amounting to 138 million (M) and 575 megabytes (MB), respectively, where the model size specifies the required storage space for all parameters of the trained model. Nevertheless, it was the winner of [31] in localization tasks and the runner-up in general.

4) SENet
The introduced CNN designs so far have been focused on improving the spatial components to strengthen the representational effectiveness in each layer even though central blocks of CNNs might have been used to obtain informative features. From that point of view, Hu et al. [76] from Oxford University designed and proposed SENet as a novel architecture by enhancing channel-wise feature responses. Despite the fact that SENet had improved the accuracy level comparing to the former winners of the ImageNet competitions [31], it had also increased the computational costs slightly bringing the number of parameters to 146M. Its Top-5 error rate was tested as 3.7%, which secured its winning place in 2017 and ever since it has been the best performing classification CNN model.

5) NASNet
Zoph et al. [77] from Google Brain proposed NASNet. The followed method was initially tracing an architectural building block with a small dataset, CIFAR-10 [18]. That was then applied to a larger dataset, ImageNet [31]. A novel regularization technique, ScheduledDropPath, was also proposed that significantly improved the generalization of NASNet. Their architecture was composed of blocks and cells, called ''normal'' and ''reduction'' cell, whereby the former returned a feature map in the same dimension and the latter returned a feature map whose height and width were reduced by a factor of two. As a consequence, NASNet reached the same Top-1 accuracy 82.7% of SENet with a lower parameter number on the ImageNet dataset.

6) ResNeXt
Until the emergence of ResNeXt [78], the state-of-the-art image classification models had relied on supervised pretraining [79], and the ImageNet dataset had been widely used. In this study, Mahajan et al. presented a peerless pre-training scheme to predict hashtags for social media images. Through the application of their unique transfer learning technique, they demonstrated a remarkable performance of ResNeXt on ImageNet classification, and thus, it was listed as the winner of the 2018 competition [31].

7) EfficientNet-L2
In EfficientNet-L2 [80], a semi-supervised learning approach [81], named Noisy Student Training, was introduced under the assumption of an abundance of labelled data. The work draws on the machine learning method of selftraining [82] and the data compression technique of distillation [83]. (More information about machine learning methods and compression techniques is provided in the Appendix.) Firstly, the EfficientNet [84] model is trained with labelled images on ImageNet, and then, the model is used as a teacher in producing pseudo labels for 300M images. Next, a larger model of EfficientNet is trained as a student with a combination of labelled and pseudo images, produced in the first stage. Following that, the same process is applied iteratively by adding to the student different types of noise, such as dropout, stochastic depth, and data augmentation, to be able to learn better than the teacher. Thus, the performance of this model on the ImageNet [31] dataset proved to be explicitly outstanding in 2020 with a Top-1 accuracy of 85.5%.

8) FixEfficientNet-L2
FixEfficientNet-L2 [85] is an upgraded model of EfficientNet-L2 [80], reported by Touvron et al. from the Facebook AI Research group, and it utilises the augmentation technique proposed by the same researchers in [86], which is aimed at reducing the discrepancy between objects by using different resolutions at testing and training time. The method has resulted in notable performance enhancements in different existing algorithms, for example achieving Top-1 accuracy in ResNet-50 [

9) VIT-H/14
Dosovitskiy et al. [87] from Google Research have brought to life the idea of the Transformer architecture [88], which is mainly used in natural language processing tasks, for image classification tasks. Their model, Vit-H/14, has shown a striking result with an 88.6% Top-1 accuracy on the ImageNet [31] dataset. The model's accuracy, thus, is ranked as the 2020's highest in Table 5.

10) LambdaResNet200
LambdaResNet200 [89] is a quite recent CNN architecture, which was reported at the 2021 ICLR conference. It introduces novel lambda layers, which enable a 4.5× faster training on modern ML accelerators compared to EfficientNet-L2 [80]. Along with the training performance, its Top-1 accuracy performance on [31] has been listed as 84.3%, which is excellent as it keeps a low model parameter number.

11) META PSEUDO LABELS
Google AI [90] introduced a novel semi-supervised learning method, named Meta Pseudo Labels, which has enabled the highest Top-1 accuracy on [31] ever. The idea improves on the student-teacher concept in [80], whereby the teacher network generates pseudo labels by using unlabelled data [91] to teach a student. The difference is that the teacher is not fixed, and it is constantly adapted by means of feedback that comes through the student network performance on a labelled dataset [31]. That produces better pseudo labels compared with the method in [80]. In this way, EfficientNet-L2 and EfficientNet-B6-Wide [80], when trained with the method, achieved the highest Top-1 scores of 90.0% and 90.2% accuracy performance, respectively. However, the parameter number is also correlative with the results reported in [80] and [87], which are the highest among the algorithms introduced above.
Inception-v1, also known as GoogLeNet, proposed by Szegedy et al. [29], is a distinctive and lightweight architecture, which is based on [92], [93]. The authors introduced the inception modules, which comprise convolutional kernels of different sizes operating over the same input, and then stacking all the outputs from the different kernels. Using convolution kernels of different sizes, the architecture was able to capture all sorts of features, containing 22 layers with 6.8M parameters. It has been the first architecture ever using 1 × 1 convolution kernels at the middle of the network to reduce dimensionality, parameter numbers, and computational budget in layers, and at the same time increasing nonlinearity. Global average pooling was only used at the end of the network instead of every layer. ReLU was the deployed activation function. By going deeper with these inception modules, it was ranked as the winner of the 2014 ImageNet competition [31].
Inception-v1 also introduced two auxiliary classifiers to improve the classification performance in the lower stages of the classification process and to rise the backpropagation gradient signal, as well as for additional regularization. The auxiliary networks are only employed for training and are not used in testing or at inference time.

2) INCEPTION-v3
Following the publication of Inception-v1 [29], Szegedy et al. [94] designed a new architecture called Inception-v3. The study first introduced the Inception-v2 model with a deep learning method, called batch normalization, which was created for normalizing the value distribution prior to the next layer, allowing to increase the accuracy performance and training speed. Keeping that novelty and introducing an additional factorization technique have led to the release of the third version, Inception-v3.
Inception-v2 had some flaws in the loss function, which was adding batch normalisation to the auxiliary classifiers in its layers, and in the optimization method. (Readers are referred to the Appendix for information on optimization algorithms.) Inception-v3 incorporated these weaknesses but avoided computational problems by using the factorization method, which factorises the n × n kernels to 1 × n and n × 1 asymmetric kernels. This led to a reduction in the parameter number of the architecture without compromising network efficiency. For instance, by using a 5 × 5 kernel in a layer, the number of parameters were equal to 5 × 5 = 25. However, by using two layers of 3 × 3 kernels, the number of parameters amounts to 3 × 3 + 3 × 3 = 18. Thus, the number of parameters was reduced by 28%.
Other specific features in Inception-v3 were the use of only one auxiliary classifier, aimed for regularization and batch normalization, and employing efficient grid size-reduction techniques instead of using a pooling layer. Due to its new efficient properties, the Inception-v3 architecture was able to achieve increased accuracy levels on [31], as detailed in Table 5.

3) ResNet
As it could be observed from the architectures discussed until now, the increase of the number of network layers leads VOLUME 10, 2022 to achieving better accuracy results. Taking this on board, the Microsoft researchers He et al. designed the ResNet model [17], including five networks with different layer numbers from varying 18 to 152.
A ResNet architecture comprising 50 layers is illustrated in Fig. 7, where bottleneck modules comprise two 1 × 1 and one 3 × 3 spatial convolutions with varied channel numbers. There are residual (identity/projection) connections, allowing to connect sequential modules to increase feature reuse. The primary contribution of the residual connections was the prevention of the vanishing gradient problem in the first layers. By means of going deeper in the network, ResNet-50 was ranked as the winner of 2015 ILSVRC [31] with a 93.3% Top-5 accuracy performance.

4) INCEPTION-v4
The high performances of the Inception models [29], [94] and the high impact on the performance of the residual connections in ResNet [17], have raised the question as to whether there is an advantage of combining the Inception models and the residual connections. Inspired by that idea, Szegedy et al. [95] proposed the Inception-v4 architecture, which is an empirical study demonstrating that residual connections considerably accelerate the training performance of Inception networks. Moreover, it was proved that the training process might be facilitated by deploying an activation function scaling technique. As a result, Inception-v4 achieved a 3.8% Top-5 error rate in ImageNet ILSVRC competition [31]. Inception-ResNet-v2 was another model proposed as part of the same study, which even outperformed Inception-v4.

5) TRIMPS-SOUSHEN
The Trimps-Soushen model was the winner of ILSVRC 2016 [31], with a 2.99% Top-5 error rate. However, the creators of the Trimps-Soushen architecture did not publish any technical report or paper. They only released the testing results of their model on the ImageNet dataset in a presentation at the 2016 ECCV workshop. Nevertheless, the results have been approved and listed by ImageNet.

6) SqueezeNet
Iandola et al. [96] proposed a new compact model, called SqueezeNet, based on the AlexNet architecture [28]. They achieved a remarkably low model size of 0.5MB, keeping the AlexNet accuracy levels. In order to obtain that result, 3 steps were followed. Initially, the 3 × 3 spatial 2D filters were mostly changed with 1 × 1 filters providing lower parameter numbers. Following that, squeeze layers were used to reduce the number of input channels to 3 × 3 filters, and finally, for utilising large activation maps in the convolutional network, a late downsampling technique was applied.
As a result, the same level of accuracy as in [28] was achieved with ×50 fewer parameters, and thus, SqueezeNet has been one of the most striking lightweight algorithms.

7) MobileNets
In 2017, Howard et al. [70] from Google Inc. introduced the MobileNets CNN, which was focused on reducing model size and complexity, targeting mobile and embedded vision applications. Depth-wise separable convolutions in Fig. 4 (b) and (c) form the basis of MobileNets. Compared to a same spatial convolution structure, the parameter numbers were reduced by means of depth-wise separable convolutions. But, neither the model size of SqueezeNet [96] nor its parameter number of around 1.2M were outperformed by MobileNets. However, in terms of the accuracy performance, MobileNets has shown a distinct enhancement with 13.1% higher Top-1 accuracy than SqueezeNet.
For a simple representation of a spatial convolution like the one expressed by (1), we suppose that the input/output spatial dimensions are identical, and the stride step is 1. Thus, we can compute the outputs of the depth-wise convolution,Ô k,l,m , in Fig. 4 whereK ∈ R D K ×D K ×I C denotes a depth-wise convolution kernel. Likewise, the point-wise convolution, O k,l,n , in Fig. 4 (c) is defined by whereK ∈ R 1×1×I C ×F is a point-wise convolution kernel. Similar to grouped convolutions (Sec. III.A.1)), the reason why depth-wise convolutions are selected is related to their efficiency, as they reduce considerably the parameter number and computational cost. This is because, in the standard spatial convolutions, the parameter number is defined by whereas the parameter number in a depthwise convolution followed by a point-wise convolution is calculated by ( , which is quite less. As to the computational cost comparison, we define the FLOPs number as These two expressions can be evaluated by using identical dimensions. They could also be compared by the ratio of the number of FLOPs between the depth-wise separable convolution and the 2D convolution, which is defined by It is evident from the above expression that it has a fractional value, which shows that the parameter number and computational budget of depth-wise separable convolutions are significantly more efficient compared to spatial convolutions.

8) XCEPTION
Xception is another model developed by Google [97], which was inspired by Inception-v3 [94], discussed in Section 2) above. It employs depth-wise separable convolutions that generally consist of depth-wise and point-wise convolutions, as illustrated in Fig. 4 (b) and (c). The Inception modules in the Inception-v3 were replaced with depth-wise separable convolutions instead of spatial 2D convolutions, which improved its performance. In spatial convolutions, whereas 1 × 1 convolution kernels provide for cross-feature map correlations, 3 × 3 and 5 × 5 convolution kernels support spatial correlations. In depth-wise separable convolutions, these correlations are provided without even using midlevel activations, and due to that Xception has outperformed Inception-v3.

9) ShiftNet
The popularity and extensive usage of depth-wise separable convolutions in CNNs led to several further research works, including a remarkable approach, proposed by Wu et al. [98]. In theory, depth-wise separable convolutions seem to improve the parameter number and computational cost of CNNs. This could be demonstrated by the ratio between computational cost and memory access, which is expressed in spatial 2D convolutions as follows: whereas the ratio for depth-wise separable convolution is the following: where the input/output dimensions are identical.
A lower ratio indicates that memory accesses take more time, causing several orders of magnitude slower operation and higher energy consumption per floating-point operation. Thus, this drawback prevents I/O-bound devices to perform to their maximum computational ability. Based upon this viewpoint, a novel convolutional operation, named shift, was proposed.
The shift operation, illustrated in Fig. 4 (f), can be viewed as a special type of depth-wise convolutions, the output of which,Õ k,l,m , could be expressed bỹ where the kernel of the operation is a tensorK ∈R K D ×K D ×I C and every kernel can be represented as follows: In (12), the i m and j m indices depend on channels and assign arbitrarily one of the values to 1 inK :,:,m ∈R K D ×K D called shift matrix. Thus, K 2 D possible shift matrices exist and any of them corresponds to a shift direction.
The shift operation could be seen as a bundle of memory operations shifting the input tensor channels into certain directions by convolving with shift kernels. In contrast to spatial 2D and depth-wise separable convolutions, the shift operation does not bring any overhead in terms of parameter or FLOPs cost due to fusing a point-wise convolution after it. This allows the point-wise convolution to directly fetch the shifted data from the cache. Thus, the shift operation has allowed a lightweight CNN model to achieve a remarkable performance on both parameter number and accuracy in [31], as seen from Table 5.

10) MobileNetV2
A year later, the MobileNet's creators, Howard et al. [99], proposed a new algorithm, MobileNetV2, improving the performance of MobileNets by revising the residual structure VOLUME 10, 2022 where skip connections were added between the bottleneck layers. Via those structural changes, MobileNet-V2 improved the Top-1 accuracy performance to 72.0 % and the parameter number to 3.4M in comparison with MobileNets. In addition, MobileNet-V2 was not only improved for image classification but was also designed for object detection tasks.

11) ShuffleNet
ShuffleNet [100] as a computation-efficient algorithm was designed for mobile devices. It is inspired by and linked to the grouped [28], and depth-wise [70], [97] types of convolutions. In a nutshell, shuffled grouped convolutions involve a grouped convolution combined with a channel shuffling.
As explained earlier in Section III.A.1) group convolutions significantly reduce the number of the total operations. Nevertheless, there is a drawback that each filter group could operate over a certain fixed portion of information from the previous layers, as shown in Fig. 4 (d). As such, filter groups are limited in the learning of few features, which weakens the representation and information flow throughout different channel groups. Channel shuffling overcomes that problem by mixing up the information between channels. In Fig. 4 (e), the feature map with three channels, I ∈R I D ×I D ×I C obtained subsequently to the first grouped convolution via three filter groups in Fig. 4 (d), is first divided into several subgroups, and then these subgroups are mixed up. Following the shuffling, the usual second grouped convolution is performed with the difference of strengthening the representation and information flow between channel groups.
In ShuffleNet, the point-wise grouped 1 × 1 convolution is also considered instead of the 3 × 3 convolutions employed in [70], [97]. The idea behind that was the lower computational efficiency of the 1 × 1 point-wise convolutions. The operation is identical to the grouped convolutions with a minor alteration of the used kernel size.
As a result, the model utilized grouped, depth-wise, and point-wise convolutions enabling a computation-efficient and lightweight algorithm while maintaining the accuracy level of 73.6 % in [31] as seen from Table 5. Thus, it has gained popularity in CNNs for mobile devices.

12) SqueezeNext
After the SqueezeNet model [96] reduced dramatically the parameter number of AlexNet [28] by keeping similar accuracy levels, Gholami et al. [101] introduced a new family of CNN architecture, SqueezeNext, which enables the same accuracy with 112× fewer parameters. There are different types of SqueezeNext models with varied parameter numbers between 1.5 and 0.54M. One of them, the 1.0-SqNxt-44 model, has achieved a 5% better Top-1 and Top-5 performance compared to SqueezeNet with the same number of parameters.
Their design was based on the lower rank filters and compression of the redundant parameters of SqueezeNet, which do not affect the accuracy. Moreover, the architecture used a final bottleneck layer cooperating to reduce the input channel size of the last FCL in SqueezeNet. The application of these microarchitecture level strategies has allowed to achieve a considerable reduction of the parameter number. (Details about micro/macro architecture design parameters are introduced in the Appendix.)

13) ColorNet
Another interesting finding has been explored by Gowda and Yuan [102]. They have focused on the importance of colour spaces of RGB images in datasets and shown that colour spaces, especially transformations of RGB images, are able to significantly improve classification accuracy. By using that idea, their architecture, named ColorNet, takes RGB images as input and converts the images into 7 different colour spaces. Following that, every colour space is used as an input to individual DenseNet modules [103]. By applying this method, the 84.6% Top-1 accuracy performance of ColorNet on ImageNet [31] was listed as the winner in 2019 on Table 5.

C. DESIGN TRENDS ON IMAGENET DATASET
The performance of the most prominent CNNs for image classification is summarised in Table 5, in which the names of the lightweight models are shaded in light grey colour, while the names of the heavyweight models -in dark grey colour. For each model, the following data are provided: literature source, year of introduction, Top-1 accuracy, Top-5 accuracy, number of parameters, number of FLOPs, model size in MB, and number of layers.
It can be observed from Table 5 that the heavyweight models achieve better accuracy levels mostly by increasing the layer and parameter numbers without regard for the model size or energy consumption. For instance, ViT-H/14 [87], which has achieved the best ever Top-5 accuracy of 99.0%, has a high layer number of 132, and a very high parameter number of 632M as well. Similarly, MPL-EfficientNet-L2 [90], which is a deeper CNN model than ViT-H/14, has obtained the second-best score on Top-5 accuracy of 98.8% and it has reached the highest Top-1 accuracy ever of 90.2% by including 154 network layers and having 480M parameters. A major downside of the heavyweight CNN designs is that their large parameter numbers prevent them from being implemented on computational devices with limited off-chip memory, such as FPGAs, ASICs, SoCs, etc.
On the other hand, as it can be seen from Table 5, the lightweight CNNs, aim to reduce the model sizes by conceding their accuracy performance, for example SquuezeNet [96], MobileNets [70], ShiftNet [98], and SqueezeNext [101]. In SqueezeNet, the layer number has been reduced to 19, the parameter number to 1.2M, and the model size to 0.5MB by utilising 1 × 1.2D spatial convolutions. SqueezeNet was implemented in an FPGA with 10MB on-chip memory without using an off-chip memory [104]. By taking advantage of depth-wise and point-wise convolutions MobileNets obtained 4.2M parameter number. As the best performance lightweight algorithm in Table 5, Shift-Net has benefitted from a parameter-free shift operation in its layers achieving AlexNet accuracy [28] with only 0.8M parameters. The aforementioned CNN developments reveal the importance and impact of enhancing the convolution operations and kernels, which have substantially contributed to these breakthroughs. However, it is beyond dispute, that the accuracy performances of the lightweight CNN designs are considerably lower compared to the heavyweight models. Nevertheless, both low and high parameter number/model sizes have pluses and minuses, as discussed in Section V.

D. EVALUATION OF COMMON CONVOLUTION PROCESSES
Due to the use of CNNs for image classification in a wide range of applications, many researchers have worked on techniques to reduce their high storage overhead and computational cost, resulting in a compact and accurate model design. As part of that, it has been recognised that carefully designed convolutional operations, which serve as a basic component of the network layers, as the ones are shown in Fig. 4, can bring significant benefits. Some of them are accuracy-oriented, whereas others improve efficiency. In this section, the training and runtime performances of each of the convolutional operations in Fig. 4 are examined and compared by employing them in the well-known CNN architecture ResNet [17].
As explained in Section II.A.2), the spatial 2D convolution in Fig. 4 (a) is frequently used especially in heavyweight models [28], [29], [75]. The rest of the convolution types in Fig. 4 are proposed to enhance the performance of the spatial convolutions. For example, the grouped convolution [28] in Fig. 4 (d) brings three essential benefits, which are: enabling of model-parallelization; more structured learning with unique representations of data; and computational VOLUME 10, 2022 efficiency by reducing multiply-add operation number. Furthermore, the shuffle operation [100] in Fig. 4 (e) allows improving the accuracy of the grouped convolution kernels. On the other hand, the depth-wise separable convolutions in Fig. 4 (b) and (c), which are used in designing lightweight architectures [70], [97] reduce the parameter and FLOPs numbers. However, despite their superiority in terms of lowering the computational cost, their fragmented memory footprints prevent efficient implementations in practice as shown by expressions (10) and (11). To overcome that constraint, the shift operation [98] in Fig. 4 (f) presents an alternative by cooperating with the point-wise convolutions in Fig. 4 (c) and aggregating the spatial information for free.

1) OUTLINE OF EXPERIMENT
The approach is to train separately different convolutions using the same reference architecture (ResNet-50 [17]). The six different convolution types in Fig. 4 are placed in the bottleneck module of the ResNet-50 architecture in Fig. 7 for a fair comparison. As the model in Fig. 7 is originally designed with spatial 2D convolutions, it forms the first experimental CNN with the required dimension adjustments arising from the used dataset, and four more designs are constructed using the remaining convolution types in Fig. 4. While the second is shaped with two-grouped convolutional kernels, the third design is created by three-grouped shuffle operations. As an exception in the fourth, 3 × 3 and 1 × 1 spatial convolutions in the original model are replaced by depth-wise and pointwise convolutional filters, respectively as they are usually deployed together [97], [70]. Regarding the final design, we remove the 3×3 spatial convolutions and place shift operations to form the fifth model. Then, we train the designed models, test them and analyse their accuracy and inference time on the CIFAR-10 and CIFAR-100 [18] datasets. Moreover, by decomposing the architectures into basic components they are analysed to find their individual execution times on compute-bound (CPU) and memory-bound (GPU) computation platforms during inference.
In the training, two NVIDIA GPUs Tesla P100s, part of the ALICE High-Performance Computing Facility at the University of Leicester, are employed according to five different model requirements with a mini-batch size of 128 and a base learning rate of 0.1. The weight decay and momentum are 0.0001 and 0.9, respectively. The training for each architecture stops after 32k iterations. In addition, the learning rate decays with a factor of 10 after 16k and 24k iterations. The inference time of the trained models is tested on both a CPU (Intel Core i5-8500) and a GPU (Tesla P100) by decomposing it into proportional slices to the time taken by the basic components of the architecture.

2) EVALUATION RESULTS
The individual performances of the evaluated convolution types are presented in Table 6 and Fig. 8.
As it could be seen from Table 6, different convolutions perform differently when deployed in the same CNN model.
The grouped convolution and shuffle operation reduced the parameter and computational overheads compared to the spatial 2D kernels as well as enabled a better accuracy, which confirms the findings presented in the AlexNet and ShuffleNet sections above. Notably, as shown in Table 6, there is a sharp drop in the parameter and FLOPs number when the depth-wise separable and shift operations are being employed. Moreover, they achieve a similar accuracy performance compared to the others.
It should be mentioned that the accuracy levels of the evaluated models in Table 6 are lower than those reported in [17], [98], [100]. One reason for that is the smaller number of iterations in our experiment, as the aim here is to assess the relative performance of the different convolution types. It can also be seen from Table 6 that the accuracy rates on the CIFAR-10 dataset are higher compared to CIFAR-100 for all convolution types, which is due to the difference in the class number, as detailed in Section II.D.1). Fig. 8 represents a summary of the timing analysis of the ResNet-50 [17] model during the inference stage for the evaluated convolutions and operations in Fig. 4. For a clear demonstration, element-wise operations, such as ReLU, batch normalization, tensor addition and concatenation, etc., are incorporated into a single word (Elwise.). Also, the pre-processing time for the data feeding is not represented here. Results are obtained under PyTorch and averaged from 30 runs. Fig. 8  First of all, it could be seen from Fig. 8 that the elementwise operation (Elwise.) in each ResNet-50 model takes more time in the GPU compared to the CPU during inference as they are more prone to compute-bound (CPU) computation platforms. The reason behind that is their memory access cost is heavy even if they have small FLOPs. As an expected outcome, the slices of the convolution (Conv.) operations in Fig. 8 (b) and (c) are lower compared to (a) as grouped convolutions reduce computational overhead. It can also be observed from Fig. 8 (c) that the slice of the shuffling operation (Shuffle) equals 12% of the runtime of the GPU, which means that its parallel hardware is limited by the shuffling operation. Nevertheless, the accuracy level and efficiency of the shuffling operation outperform spatial convolutions as per the results in Table 6.
The pie charts in Fig. 8 (d), verify expressions (10) and (11). Even though depth-wise convolution filters require fewer parameters and lower computational overhead in theory, unlike spatial convolutions, their memory access demand dominates computations and limits the performance of memory-bound computation platforms such as GPUs. As it can be seen in Fig. 8 (d), the dept-wise convolution operations (Depthwise.) occupy a substantial part of the runtime in GPU amounting to 53%. This drawback also indicates higher energy consumption per floating-point operation, thereby leading to an inefficient computation. Lastly, Fig. 8 (e) proves that the shift operation takes a step further compared to the depth-wise separable convolution in terms of performance. Despite its considerable runtime, caused by the large number of memory accesses instigating bottlenecks, the shift operation requires a lower number of parameters and FLOPs and occupies a smaller slice of the GPU runtime compared to the dept-wise convolution in Fig. 8 (d), i.e. 25% versus 53%.
In conclusion, based on a fair comparison, Table 6 and Fig. 8 practically confirm the introduced theoretical findings for the popular convolution types and reveal invisible timing aspects related to their runtime execution.

IV. REVIEW OF CNNs FOR OBJECT DETECTION
Object detection has been a long-standing theme in computer vision research. With regards to self-driving vehicles, object detection is an essential part of modules in the ADSs pipeline, such as scene understanding and object tracking. The aim is to identify the locations and sizes of the objects present in images taken by cameras at the viewing angle of the car. These could be both static objects, e.g., traffic lights, road signs, etc., and dynamic objects, e.g., vehicles, pedestrians, etc.
Looking back over the last two decades of detection methods, until the early 2010s the limitations on computing resources, datasets, and the mostly theoretical nature of DNN development had led to employing traditional detectors, such as DPM [106], Selective Search [107], Oxford-MKL [108], HOG [109], NLPR-HOGLBP [110], SIFT [111], VJ Det [112], Bag of Words [113], etc. A modular data-flow diagram and a historical publication timeline of these popular detectors are depicted in Fig. 9 and Fig. 10, respectively.
The purpose of the Region Selector module in Fig. 9 is to prepare sliding windows of different sizes and to slide them over the image from left to right and top to bottom by keeping a certain step size. Then, cropped image blocks are produced by the sliding windows and converted to images with uniform dimensions. In the Feature Extractor module, features are extracted from the images by the way of deploying different algorithms such as HOG [109], SIFT [111], etc. The last module in Fig. 9 is a Classifier, which is used to identify the category of objects, extracted in the previous step via different algorithms like SVM [114] and Adaboost [115]. Despite their popularity, the traditional detectors were relatively mature and had multiple VOLUME 10, 2022 drawbacks, such as high window redundancy, computing complexity, etc.
Following the emergence of high-performance computing devices and inclusive datasets as well as the staggering performance of AlexNet [28], in the last decade, the attention in the field gradually turned to CNN-based solutions, as illustrated in Fig. 10.
In the rest of this section, first, building blocks for CNN-based object detection are discussed, such as backbone networks and baseline types, then representative stateof-the-art designs are outlined, dividing them into two categories: one-stage detectors and two-stage detectors. Table 7 presents experimental results from the literature for each of the reviewed models based on particular image datasets.

A. BACKBONE NETWORKS
Object detection models deploy a classification algorithm as a backbone or a base network that acts as a basic feature extractor. The CNN models for image classification, introduced in Section III above, such as ResNet [17], Xception [97], SqueezeNet [96], MobileNets [70], ShuffleNet [100], can be directly adopted or improved with new features to be used as a backbone for object detection tasks. In several publications [116]- [118], specific requirements have been stated for existing classification algorithms to perform better in a detection pipeline. Additionally, it is recognised that even if detection speed is a key factor, e.g., in real-time applications, high precision and accuracy hold at least an equal importance as well. That means that there is a need for a good trade-off between speed and accuracy [19] when selecting backbone networks for detectors. The recently published high-performance CNN architectures for classification may help to solve the problem, as quality feature extraction raises up the whole detection performance. For example, He et al. [119] achieved a remarkable performance following this approach. Further details are included in the descriptions of the individual detection algorithms in Sections C and D below.

B. TYPICAL BASELINES IN CNN-BASED DETECTOR DESIGN
In CNN-based detector design, there are two commonly used baseline schemes: one-stage and two-stage detection pipelines. Among the two-stage detectors, the Region-based Convolutional Neural Network (R-CNN) series [120]- [122] are the most prevalent ones, whereas YOLO [116] and SSD [123] are the CNNs of choice for the one-stage detectors. The functional diagrams of the two baselines are illustrated in Fig. 11. The two-stage detector pipeline is shown in Fig. 11 (a), where region proposal network block (Proposal Generation) feeds region proposals into the classifier and localization modules. These detectors differ from the onestage detectors, depicted in Fig. 11 (b), in the operation of the RoI (Region of Interest) pooling layer. Unlike the two-stage detectors, where the prediction of the bounding-boxes is carried out by the region proposal stage (Proposal Generation), in the one-stage detectors, it is implemented directly from the input images, which allows them to perform faster. This is why one-stage detectors are more preferred in real-time applications despite their low accuracy performance compared to the two-stage ones. Weighing the merits of the two schemes, it could be said that the accuracy of object recognition and localization is higher in two-stage detectors, whereas the inference speed is better in one-stage detectors [124]. Fig. 10 displays the entire timeline of both the traditional object detectors and the CNN-based designs of the last decade. It could be seen from Fig. 10 that the twostage CNN-based detectors preceded the development of the one-stage detectors. Therefore, in the following review of the existing object detectors with CNNs, first, the two-stage designs are introduced in Section C, and then the one-stage detectors -in Section D.

C. TWO-STAGE CNN DETECTORS 1) R-CNN
In 2014, Girchick et al. [120] introduced a detection algorithm called Region-based Convolutional Neural Network (R-CNN), which featured a high object detection performance. Fig. 12 illustrates the architectures of the three bestknown R-CNN models: R-CNN [120], Fast R-CNN [121] and Faster R-CNN [122] during both the testing and the training phase. The diagrams related to the testing phase are depicted using only amber colour. The diagrams of the training phase are depicted using both amber and purple colour for the Fast R-CNN and the Faster R-CNNs in Fig. 12 (b) and Fig. 12 (c), respectively, which include multi-task loss functions.
As shown in Fig. 12 (a), the R-CNN method starts by identifying category-independent ∼2k region proposals containing potential objects, and then, proceeds each region to the backbone network AlexNet [28] to extract 4096-dimensional feature representations. Lastly, while an SVM is deployed for classification, fine adjustments of the bounding boxes are provided by a Bounding-Box regression and a greedy nonmaximum suppression (NMS) method [125]. Using such a design, R-CNN achieved an improved mean average precision (mAP) of 58.5% on the Pascal VOC dataset [35]. Despite its accuracy performance, R-CNN had a few design problems. For instance, it was slow in processing the images, taking a second per image or 1 frame per second (fps), and required many GPU days to be trained, as ∼2k region proposals per image had to be classified in the network. The algorithm, therefore, could not be utilised in real-time. In order to solve these drawbacks, new versions of R-CNN were proposed afterwards.

2) SPP-NET
He et al. [126] proposed a new algorithm, SPP-Net, with a strategy, which is referred to as spatial pyramid pooling (SSP) eliminating the singly passing of ∼2k region proposals to the backbone network in R-CNN. SPP-Net firstly computes a convolutional feature map of the whole input image and following that, classifies each object proposal by taking advantage of the spatial pyramid pooling layer. Thus, SPP-Net accelerated the testing time of R-CNN by between 10 and 100 times. Another superior side of SPP-Net was that it could be used for increasing the performance of all CNN-based tasks, like image classification. ZF-Net [74] was used as a backbone to measure the mAP performance of SPP-Net on Pascal VOC [35] dataset and it achieved 60.9%, which was better than the R-CNN performance with a lower computational load. In spite of its advantages, SPP-Net also had several drawbacks of [120]. The training was a multistage pipeline that included extracting features from the input image, which required fine-tuning of the network with log loss, using of SVMs, and providing bounding boxes for the regression stage. All those signalled new approaches in the field.

3) FAST R-CNN
In an attempt to resolve the problems of R-CNN [120] and SPP-Net [126], outlined in Sections 1) and 2) above, Girshick [121], from Microsoft Research, proposed a new algorithm called Fast R-CNN at the ICCV 2015 conference. By introducing several innovations in the Fast R-CNN design, the training and testing speed of R-CNN was improved, and its detection accuracy was increased as well. A novel RoI pooling layer was fitted to the network as the single-level spatial pyramid pooling layer in SPP-Net. Fig. 12 (b) represents the architecture, where region proposals are first created by a Selective Search algorithm [107] and then, mapped onto the feature maps. Following that, the RoI Pooling layer prepares different feature regions as fixed-size feature vectors, which are then used as inputs in FCLs. Finally, whereas object categories are predicted by the softmax operation [127] with log loss [128] in contrast to R-CNN employing SVM [114], object locations are located by a bounding-box regression with a smooth L1 (absolute) loss [129]. Thus, its performance on the Pascal VOC 2007, 2010, and 2012 datasets on training with the backbone network VGG-16 [75] was measured as 70.0%, 68.8%, and 68.4% mAP, respectively. Fast R-CNN was trained on the VGG-16 base network 9× faster than R-CNN and was 213× faster during testing. As for SPP-Net, Fast R-CNN was 3× faster in training, 10× faster in testing, and more accurate when training on VGG-16. Besides improving training, testing, and accuracy performance, there were three other contributions which are: end-to-end training, no need for disk storage, and updates on all network layers.

4) FASTER R-CNN
In R-CNN [120] and Fast R-CNN [121], region proposals were being generated by different blocks, rather than using only one convolutional neural network. That was timeconsuming, i.e., ∼2.3s for making predictions and ∼2s for VOLUME 10, 2022 generation of 2k RoIs, representing a bottleneck in system performance. To solve the drawback, Ren et al. [122] proposed a novel Region Proposal Network (RPN), shown in Fig. 12 (c), which was constructed by using CNN-based layers, where the region proposals are generated right after the backbone network. Unlike its ancestors, the number of RoIs is not a constant value and is defined by the size of the feature map. Thus, the region proposals were implemented on GPUs with nearly free of computation cost compared to previous baselines. The performance results of the algorithm are detailed in Table 7.

5) RFCN
Dai et al. [130] addressed the basic shortcoming of Faster R-CNN [122], which was related to the RoI layer being located between the backbone network and the object detection modules. Such a design caused a lack of translational invariance in the detector as the RoI was converting the multi-dimensional feature map to fixed size FCLs. To overcome the problem, they proposed position-sensitive score maps in the design of RFCN, based on a fully convolutional network. Thus, all learnable parameters were convolutional and sharable in the kernels, which was increasing the translational invariance. It achieved 80.5% mAP on the Pascal VOC [35] dataset with a ResNet101 [17] backbone network. Further details are represented in Table 7.

6) MASK R-CNN
Mask R-CNN [119] is a result of extending Faster R-CNN [122]. It has been primarily designed as a framework to perform a segmentation task in addition to object detection. The Mask network as a modernised version of FCNs [131] is placed in parallel to the detection pipeline for the generation of split masks for each RoI. Due to an alignment problem between the feature map and the original image in the RoI pooling layer, which uses integer quantization, a RoIAlign layer based on a bilinear interpolation method was employed. This allows preserving the alignment at pixel-level in the exact spatial locations between the feature map and input image. It also enables to achieve a better detection accuracy. Thus, by deploying ResNeXt-101 [78] as a base network, a rise in performance was achieved compared to the previous detectors, as shown in Table 7.

7) FPN
In 2017, Lin et al. [117] proposed the Feature Pyramid Network (FPN) that fuses features and enhances the system detection performance. The algorithms, proposed until then, were detecting either a top-level feature or were performing an independent detection in feature layers. This disallows to combine classification and location information. However, FPN introduced a way of laterally connecting layers topdown and bottom-up, evoking a pyramidal operation. FPN was incorporated in Faster R-CNN [122] and achieved stateof-art results on the MS COCO dataset [36], as detailed in Table 7.

8) NASNet
NASNet [77] was not solely designed for the purpose of image classification. It was also targeted at object detection tasks, as the learned features from the classification process can be used in detectors. Thus, NASNet used the Faster R-CNN [122]

9) DETR
Carion et al. [132] from the Facebook AI team developed a new CNN model, called DETR, which was designed as a solution to the direct set prediction problem for object detection. The model does not include anchor generation [133] and non-maximum suppression [125] as they prevent the use of prior information in the detection pipeline.
There are two main contributions. The first one is a set-based global loss, which enables powerful predictions through partite matching and the other one is a transformer encoder-decoder architecture. Via these strategies, the model has improved a bit further the accuracy rate of the twostage detectors to 44.9% mAP with 10 fps in the COCO dataset [36].

10) DYNAMIC R-CNN
In 2020, Zhang et al. [134] published an advanced R-CNN algorithm. Their idea was based on addressing the inconsistencies between the dynamic training process and the fixed network adjustments, which significantly affect performance. To tackle the problem, Dynamic R-CNN provides an automatic and dynamic adjustment on both the shape of the regression loss function, e.g., the parameters of Smooth L1 Loss [129], and the label assignment criteria, e.g., the IoU threshold [135], by using the region proposal statistics in training. Moreover, no extra cost has been added to the model. Thus, the model performance was improved considerably achieving 49.2% mAP on the COCO dataset.

D. ONE-STAGE CNN DETECTORS 1) YOLO
Even though the proposed RPN in Faster-RCNN [122] reduces the region proposal number and the overlaps between them, the repetitive computations due to inevitable overlaps still cause an essential bottleneck in the performance of these detectors. So as to deal with the problem, Redmon et al. [116] introduced a new hybrid CNN-based architecture, named YOLO (You Only Look Once), which is illustrated in Fig. 13 (a). It combines the region proposals and detection branches into one single stage in contrast to R-CNN [120], Fast R-CNN [121] and Faster R-CNN [122], as seen in Fig. 13 (a), where the classification and bounding boxes regression are responsible to detect the object centres in the grid cells.
YOLO detectors first divide the input images into S × S grids and then, each grid is responsible to predict C class probabilities, B bounding boxes, and the confidence value for those boxes. Thus, the input is encoded as a tensor with S × S× (5B + C) dimension. The end-to-end architecture is built by 24 convolutional and 2 FCLs. As a result, YOLOv1 has achieved a high processing performance of 45 fps, which satisfies real-time implementation requirements. Further details are presented in Table 7.
Despite its success, the YOLO model has also exhibited a few limitations. As a major problem, its detection ability on dense small objects is inefficient, since each grid predicts two bounding boxes from the same class at most. Another shortcoming is its generalization ability in case of different aspect ratios of objects on different images during testing. VOLUME 10, 2022 Lastly, the loss function is in a need of improvement, as it affects the detection score.

2) SSD
As discussed in Sections C) and D.1) above, the R-CNN series [120]- [122] and YOLO [116] come with their superior and inferior features, trading-off between accuracy and speed. Liu et al. [123] proposed the Single Shot MultiBox Detector (SSD), which has advantages on both sides. In the SSD design, VGG-16 [75] is used as the backbone network, in which the fully connected sixth and seventh layers are replaced with convolutional layers as well as four more convolutional layers are placed at the end.
The entire design, which comprises six-stages, is hierarchical, and represents a single forward pass network. The reason for such a design is to provide hierarchical extraction of features, where each hierarchical layer provides object classification and bounding-box detection with different semantic information levels, as it could be seen from Fig. 13 (b). Moreover, each stage applies a fast non-maximum suppression (NMS) technique [125], aimed at post-processing of redundant bounding boxes. The purpose is to eliminate overlaps in bounding boxes at every stage, and it also leads to reducing the amount of computation without compromising the accuracy. Thus, SSD512 outperforms Faster R-CNN on both accuracy and speed, and SSD300 shows a processing performance of 59 fps, which is higher than that of YOLO.

3) SqueezeDet
In 2017, another CNN model was proposed by the creators of SqueezeNet [96], named SqueezeDet [136], which is a fully convolutional neural network designed for object detection. In fact, it received inspiration from both YOLO [116] and SqueezeNet. SqueezeDet is aimed at satisfying realtime constraints of embedded deployment in terms of model size, energy efficiency, as well as a high level of accuracy, as discussed in Section V.D. below.
Due to its single forward pass neural network pipeline, the SqueezeDet architecture is effectively fast, small model sized, accurate and energy efficient. By keeping the same accuracy as previous baselines, such as YOLO, it enables 8MB model size, which is 30× smaller compared to Fast RCNN + AlexNet [121]. The energy consumption of 1.4J per fps in NVIDIA Titan X GPU is 35× lower compared to [121], and, the inference processing speed of 57 fps, is 20× faster compared to [121]. It also requires fewer DRAM (Dynamic Random-Access Memory) accesses and enables the best average precision in all three difficulty levels of cyclist detection of the KITTI dataset [37], which is introduced in Section II.D.1).
To address the problems, the YOLOv2 [137] model introduces a number of improvements on YOLO. First, it enhances the generalization capability by means of batch normalization [138], which speeds up the optimization process. The anchor idea [122], aimed at increasing the generalization capability in different aspect ratios, is also incorporated in YOLOV2, and thus, more scale and aspect ratio can be predicted by each grid cell. The second improvement is that it trains high-resolution classifiers to locate images with higher resolutions. Third, in order to increase the detection ability, the K-means clustering algorithm [139] is deployed to automatically find the bounding boxes. Lastly, it handles the instability of the YOLO model [116] by limiting the ground truth offset in regard to the grid coordinates. By upgrading YOLO with these advancements and designing a new base network, Darknet-19, based on VGG-16 [75], YOLOV2 has achieved a 78.6% mAP on the COCO dataset [36] and a higher number of learned object categories.

5) MobileNetV2-SSDLite
MobileNetV2-SSDLite is based on the classification CNN MobileNetV2 [99], which was discussed in Section III.B.10) above. It includes an architecture for object detection using a modified version of the SSD (Single Shot Detector) model [123] that is a mobile variant of SSD, whereby the main idea comes from the separation of regular convolutions. All the spatial 2D convolutions in the SSD prediction layers are replaced with separable convolutions (i.e., depth-wise convolutions, followed by point-wise convolutions). As a result, its performance was even better than that of YOLOv2 [137] with 10× fewer parameter number and 20× less computational load.

6) YOLOv3
YOLOv3 [141] inherits the superior sides of the previous YOLOv1 and YOLOv2 models and handles their shortcoming balancing accuracy and speed. To actualize the object, YOLOv3 combines residual networks [17], FPN [117], and binary cross-entropy/log loss [128]. As a result, the model can detect complex multi-size objects with more categories achieving performance of 28.2% mAP at 22ms or 33.0% mAP at 51ms on the COCO dataset [36].

7) CenterNet
In 2019, Duan et al. [142] proposed an algorithm, CenterNet, whose motivation comes from the realisation that faults in the detection of bounding boxes are the main problems in architectures, and if the cropped regions are looked into again, the accuracy performances could level up. CenterNet, therefore, focused on minimising that kind of errors. They built their framework being inspired by CornerNet [140], which is a typical one-stage key-point based detector. Unlike a pair of key-points in CornerNet, each object in CenterNet is detected as a triplet of key-points, which allows to improve precision and recall accuracy.
As seen from Fig. 14, a convolutional backbone network is used to define a two-corner feature map and centre key-point features on the map by employing cascade corner and centre T. Turay, T. Vladimirova: Toward Performing Image Classification and Object Detection With Convolutional Neural Networks  pooling modules, respectively. By using these feature maps, CenterNet detects the final bounding boxes. It outperformed on the COCO dataset state-of-the-art detectors such as Faster R-CNN [120], YOLO [116], SSD [123] and CornerNet, achieving an accuracy performance of 47.0% mAP.

8) YOLOv4
Bochkovskiy et al. [144] introduced YOLOv4 at CVPR 2020 as the latest version of the YOLO series. It incorporates several existing universal and unique strategies, such as DropBlock regularization, CIoU loss, Mosaic data augmentation, Mish activation, etc., into the architecture. CSPDarknet53 [145] is used as the backbone network as well as SPP [126] and PAN [147] are employed as neck layers. These are transition layers between the backbone and head blocks (classification and bounding boxes), which are inserted to improve detection accuracy. Consequently, its real-time processing speed and accuracy are quite good with 43.5% mAP at 65 fps, which is better than the previous YOLO version, YOLOv3 [141].

E. DESIGN TRENDS ON MS COCO AND PASCAL VOC
The performance of the leading CNN-based detectors is summarized in Table 7, in which the names of the two-stage models are shaded in light grey colour, while the names of the one-stage models -in dark grey colour. For each model, the following data are provided: literature source, backbone network, year of introduction, model size in MB, processing speed in frames per second (fps), implementation platform, VOLUME 10, 2022  and mean average precision (mAP) on the Pascal VOC [35] and MS COCO [36] pascal datasets.
It can be observed from Table 7 that the models that have performed best in terms of accuracy in their baseline class are Detr [132] and Dynamic R-CNN [134] for two-stage detectors and CenterNet [142] and YOLOv4 [144] for onestage detectors. It can also be seen from Table 7 that the model sizes of these detectors are quite high among models. Thus, it seems that heavyweight detectors have provided higher accuracies on the test datasets like the image classification models, discussed in Section III.
As for the low model size detectors, the winners are SqueezeDet [136] and MobileNetV2-SSDLite [99] having the outstandingly small size of around 8MB and 18MB, respectively, according to the literature. However, their accuracy levels are remarkably low for any real-time implementation, compared to other state-of-the-art detectors. So, lightweight CNN-based object detection solutions have shown the same tendency as classification CNNs.
It is not surprising that the evolution of the object detection CNN-based models and the CNNs for image classification have shown similar tendencies. This is because they have a close historical and architectural relationship.

V. AUTONOMOUS DRIVING SYSTEMS
The accumulation of knowledge on vehicle dynamics, the ground-breaking improvements on computer vision through the emergence of deep learning, and the advent of newly designed sensor modalities energized the R&D activities among researchers and developers of Autonomous Driving Systems (ADSs). From the first large-scaled automated driving competitions, DARPA Grand [148] and Urban [149] Challenges, to this day, numerous approaches have been proposed, and common system architectures have been established. In addition, substantial tasks in ADSs have been divided into subcategories, and explicit dominance of deep learning (DL) models has been seen in a number of subcategories [150]. Nevertheless, robust ADSs in urban environments have not been implemented yet [151].
Owing to the extraordinary advances in CNNs as a subbranch of DL, they have become a promising choice for the implementation of visual tasks in different modules of Autonomous Driving Systems, aimed at reducing human interventions in driving. This section first presents a review of the most recently published ADS related works that are focused on CNN-based image classification and object detection. Then, an outline of the current industry status is given. Next, the architecture and components of computational pipelines in the latest ADSs are discussed. Following that, application constraints are introduced, and, lastly, future directions are summarised.

A. REVIEW OF ADSs RELATED WORKS ON OBJECT CLASSIFICATION AND DETECTION
This review is targeted at the perception part of ADSs, which is one of the subsystems of self-driving cars, among others, as detailed in Section V.C. below. It is responsible for understanding scenes of the driving environment that include objects from multiple different categories. Consequently, multi-label image classification [152] is employed in ADSs rather than the traditional single-label image classification, which handles images as containing one category of an object per image. Multi-level image classification is also referred to as object classification in the rest of the paper.
The image classification and object detection algorithms in Tables 5 and 7 are specifically designed for generalpurpose datasets, such as ImageNet [31], Pascal VOC [35], MS COCO [36], which can be utilised in different fields, as discussed in Section II.D.1). To enhance the object classification and detection capabilities of ADSs, a diverse set of naturalistic driving scenarios have to be addressed, e.g., all weather conditions, day/night time, pedestrians, traffic lights, cyclists, traffic density, etc., which are not properly represented in domain-general datasets. Thus, it is necessary to use domain-specific datasets (exemplified in Section II.D.1)).
In this section we overview ADSs related works on object classification and detection, which are published in the last three years, by analysing them in terms of dataset collection scenarios, sensors, and detection types.

1) SELF-DRIVING SCENARIOS
Understanding the complexity of naturalistic road scenes is vitally important when there are diverse self-driving scenarios, such as day/night time and weather conditions. In fact, handling diverse driving conditions represents the most challenging research part of ADS design. Hence, enhancements in object classification accuracy of such scenes can be seen as a foremost solution to overcoming the difficulties.
In this direction, Li et al. [153] proposed a deep adaptive neural network for multi-label image classification, ML-ANet, which enhances the accuracy performance in cross-domain adaptations. Aiming at an effective knowledge transfer between similar but different domains, the proposed approach exploits the technique of transfer learning from a fully labelled to a limited or unlabelled domain. In order to achieve this, the domain discrepancies are reduced by distributing feature maps of source and target domains via multiple-kernel variants of maximum mean discrepancies (MK-MMD) loss. The experimental work is primarily focused on domain changes arising from the conversion between clear and hazy weather conditions. It is shown that the proposed multi-label classifier network outperforms existing state-of-the-art methods [154]- [156] on three commonly used domain-specific datasets: KITTI [37], Cityscapes [39], and Foggy Cityscapes [157]. The ML-ANet neural network is also capable of adapting to round the clock illuminations in diverse weather conditions. In addition, it offers a reduced development cycle, as there is no need of fully labelled training data.
Three models aimed at a detection of pedestrians in hazy weather, which are based on the YOLO CNNs (Sections IV.D.1) and IV.D.4)) are proposed in [158]. In one of the models, named MNPrioriBoxes-Yolo, the detection precision is increased by employing a new weighted combination layer. The network features a high speed and low computational cost due to the use of depth-wise separable convolutions and linear bottlenecks. In addition, a modified priori boxes method [122] is employed to enhance precision and detection speed. Evaluation results based on a purposebuilt hazy weather pedestrian dataset, which was created using six different augmentation techniques, have shown that the proposed model outperforms state-of-the-art methods in terms of accuracy and speed.
Fog is also an adverse weather condition that can occur in the driving environment. The object detection performance of Faster R-CNN [122] in four levels of foggy weather: clear (no fog), light, medium, heavy is analysed in [159]. Other works could be found in [160]- [162].
A large number of works have been addressing rainy and snowy weather conditions. The performance of the Faster R-CNN [122] and YOLO-v3 [141] detectors under clear and rainy conditions is analysed in [163]. The impact on performance of mitigating the effect of rain is investigated using various techniques, such as image translation [164], domain adaptation [165], and deraining (i.e., raindrop removal) [166]. The BDD100K dataset [38] is used in the experiments, as it offers image tagging for weather types. Thus, each image weather condition, such as foggy, rainy, and so on, is labelled. The evaluation results show that the mitigating techniques have a positive effect on the rainy weather detection performance of the employed CNNs.
Regarding snowy weather, Bernuth et al. [167] apply an adverse weather augmentation approach to reuse existing, well-organised, and labelled datasets, KITTI [37] and Cityscapes [39]. The approach produces lifelike and physically correct images to be added to existing ready-touse datasets created with real-world images. Rendering the scenes with snowflakes for realistic images is provided by using OpenGL. Then, these generated images are evaluated in the proposed object detection algorithm to verify their efficiency. Other recently published works regarding rainy and snowy weather are [162] and [168].
Driving at night-time under low-light conditions causes a lack of details in the driver's field of vision and increases accident risks. Thus, many research works have been targeted at finding possible solutions. Li et al. [169] proposed an image enhancement network, LE-net based on CNN, which specifically addresses the exceedingly low-light conditions at night-time, such as rural areas without street lighting. Then, the study follows four sequential steps: (i) building a novel VOLUME 10, 2022 pipeline to generate low-light images from existing daytime images, extracted from BDD100K [38]; (ii) generation of naturalistic image pairs via the proposed pipeline for model development; (iii) training and validation of the designed network with the generated low-light images; (iv) testing of the network, LE-net, on real night surroundings with variable low-lights. It is shown that on the BDD100K dataset, the proposed network outperforms other models [170]- [172]. The EuroCity [44] dataset is used as a second evaluation metric, whereby a better performance is also observed in comparison to other studies [171]- [173].
To enhance the accuracy performance of object classification tasks at night, a dataset is created in [174] by combining two well-known datasets, MMSP [175] and BDD100K [38]. Following that, the created dataset samples are subclassified into daytime and night-time, and similar samples in the dataset are removed to prevent overfitting. The evaluation results have shown an improved night-time object classification performance. Other works addressing night-time object classification and detection are [176] and [162].

2) SENSORS
The works introduced in the section above mostly use camera-based image data, which are susceptible to level changes in the lighting conditions due to different seasons, intemperate weather, and shifting shadows. Changes in illumination could lead to failure of algorithms, affecting badly the quality of perception. Alternative sensors for perception tasks in ADSs are lidar and radar, however lidar struggles with foggy and snowy weather [177], while radar lacks sufficient resolution for perception tasks [178].
Sensor fusion is currently employed to improve precision and prevent any single point of failure [179]. Furthermore, dynamic conditions in camera-only systems [180] and low lighting conditions in thermal infrared imaging [181] are dealt with by data processing techniques that extract lighting invariant features [182] and assess feature quality [183]. Despite that, perception quality remains a central issue and prevents the ADSs from becoming prevalent in vehicles.
In addition to illumination changes, another issue is the image space for camera-based perception, as the scale of the image scene is unknown in advance. In fact, making use of scale information in dynamic tasks, such as obstacle avoidance is feasible by means of a single camera [184], although multi-view or stereo systems are more preferable in terms of robustness [185]. However, they lead to a considerable amount of computational load to an already complicated perception pipeline. As a relatively new and alternative sensing method for 3D perception, 3D lidars exist to solve the scale problem as they depend far less on illumination changes and intemperate weather. Nevertheless, their owning cost remains as a noteworthy problem as stated in Section I.A.
In view of the above, numerous recent works have been concerned with enhancing the perception quality of selfdriving scenes on different collection scenarios and improving the object classification and detection performances in real-time. For instance, Gao et al. [186] have fused lidar and vision data in an object classification task. Point clouds of lidar data and RGB images from the KITTI [37] dataset are used in the study. They initially upsample and convert the point cloud data into depth feature maps at pixel-level, and the RGB images are also converted to depth feature maps. Following that, the integrated depth and RGB data are fed into a CNN. Thus, experiments on the KITTI [37] dataset exhibit superior object classification accuracy compared to using only depth or RGB data. Moreover, acceleration in feature learning and convergence has been achieved using lidar data.
Another fusion-based system [166], which employs a cheaper 4-beam lidar, rather than an expensive 64-beam one, and a stereo camera for 3D object detection, handles the cost problem of lidars. Due to a significant enhancement of the depth estimation technique, the proposed method has shown an improved 3D object detection performance on the KITTI dataset [37].
In [187], the performance of point clouds of lidar data on 3D object detection is addressed by simulating clear weather to foggy weather compared to other sensor types. Details about more sophisticated recently published works can be found in VoxelNet [188], SECOND [189], IPOD [190], PointPillars [191], PointRCNN [192], F-PointNet [193].

3) DETECTION TYPES
So far, the review of ADS related works has focused on the overall content of the self-driving scenes, or in other words, the big picture. On the other hand, enhancements of object classification and detection of specific features of the scene, such as pedestrian, cyclist, traffic sign and light, pavement marking, etc. are also quite important, e.g. from a safety point of view. Here we call these sub-classification and detection types.
In a recent study on pedestrian detection, Boyuan and Muqing [194] placed a novel Spatial Pyramid Pooling (SPP) network to YOLOv4 [144] detector for an improvement of the detection accuracy, where a Mish activation function [195] is used instead of Leaky ReLU. Then, the anchor parameters of YOLOv4 were optimized with a K-means clustering algorithm. As a result, an excellent pedestrian detection accuracy of 84.7% mAP was obtained with 36.6 fps in realtime, on the Caltech [43] dataset (introduced in Table 3). In [196], thermal imaging, enabling day/night time and illumination-independent data collection, was used to implement a robust pedestrian and cyclist detection using the Faster R-CNN [122]. Details of other related approaches are available in [197], [198], and [199].
Detection of traffic signalization, comprising elementarily traffic signs, traffic lights, and road surface markings in the surrounding of self-driving cars allows the ADSs to make correct decisions according to the traffic rules. There are several most recent studies regarding this research topic. Henchri and Mtibaa [200] propose a two-stage approach for traffic sign detection. In the first stage, only the shape of the signs, which are circular or triangular, are detected and classified by using HOG [109] features and support vector machines (SVM). Then, the second stage by means of a CNN attempts to classify these detected shapes into their own subclasses. Finally, the approach is tested on the GTSDB benchmark [201] with improved results.
Another significant object for ADSs in road scenes is traffic lights. An SSD [123] detector is exploited for an adaptation study to detect traffic lights and small objects in [202]. The Inception-v3 CNN (Sec. III.B.2)) is employed as the base network instead of the originally enlisted VGG CNN (Sec. III.A.3)) to increase speed and accuracy. The study adapts a prior box generation allowing smaller strides in the latter network layers, which enables detections of smaller object. Non-maximum suppression (NMS) is also adapted to prevent multiple detections for a single object. Finally, an additional block is inserted to classify the states of the traffic lights (i.e., red, amber, or green). The model showed a good performance on the DriveU [203] traffic light dataset.
Road surface markings as an important component of traffic signalization have been also studied in recent years. Ye et al. [204] have addressed partly distorted and worn road markings by means of a two-stage model based on YOLO-v2 [137]. The first stage is responsible for the detection of initial road markings with coordinates and class confidences of bounding boxes by using the YOLO-v2 CNN (Sec. IV.D.4)). In the second stage a novel, lightweight, and transformation-invariant classification network, RM-Net, is used for road markings detection, which is able to tackle the distortions and surface wears. An annotated road marking detection dataset for public use is developed, consisting of ∼12k high-resolution images, grouped into 13 classes, which are collected under various weather conditions in day/night time. The model achieved 86.5% mAP on this dataset, outperforming other existing frameworks.
Details of other most recently performed works could be found in [205] and [206] for traffic sign detection, [207] and [208] for traffic light detection, [34] and [209] for road surface marking detection. Readers are referred to [23] for information about more works on traffic signalisation.

B. CURRENT INDUSTRY ACTIVITIES 1) TAXONOMY OF VEHICLE AUTOMATION
In 2017, the US National Highway Traffic Safety Authority [210] released guidelines for ADSs, in which six levels of automation were specified [211]. These are summarised in Table 8 and are outlined below.
• LEVEL 0: NO AUTOMATION All driving tasks must be completed by a human driver, which also means zero autonomy.
• LEVEL 1: DRIVER ASSISTANCE The vehicle is mainly controlled by a driver. Only under limited driving conditions, some driving assistance features, such as steering, acceleration/deceleration, are shared with the automated system. There is no task assigned to the human driver.
In brief, as seen from Table 8, whereas in levels 1 and 2 drivers are still responsible for a few tasks, in levels 3-5 the responsibility for all driving tasks, even lane change, is supposed to be handled automatically. Fig. 15 shows recent updates to the level definitions above by the Society of Automotive Engineers (SAE) [211].

2) CURRENT STATUS QUO IN INDUSTRY
This subsection presents the results of a literature survey on the current status and activities of industry leaders in terms of the achieved level of automation as well as used sensors and computing platforms in the vehicle, in an attempt to establish where automakers stand. A summary of the findings is presented in Table 9.
It could be seen from Table 9 that even Tesla [12], Audi [212], and Mobileye [13], [14] which are leading automotive companies, have only achieved automation levels 2 or 3, in which the driver must be highly involved in completing the driving tasks. It would appear that the production process relating to ADSs of many automakers is still under experimentation. In addition, some of them have imposed certain conditions on the use of their ADSs, i.e., driving in confined areas such as specific cities, particular highways, etc. Currently, levels 3-4 can be operated only in limited Operational Design Domains (ODDs) like highways, VOLUME 10, 2022 FIGURE 15. SAE recent updates on automation levels [211].
and Audi claims to be the first company which have produced a level-3 vehicle under the condition that it is driven on a restricted highway [213].
In 2020, Volvo and Lidar-maker Luminar [214] stated that they aim to deliver genuine hands-free level-3 driving without monitoring surroundings by drivers in the real-world environment, rather than under certain conditions, by 2023. Although, as shown in Table 9, Waymo [215] have reached level 4 driving under certain conditions, there has not been yet any production of a vehicle satisfying levels 3-5 driving on any road in an urban environment. Furthermore, the Toyota Research Institute [216] have stated that there is no one in the industry that has been even close to attaining level 5. Thus, the details in Table 9 show that there are still challenges in the production of highly automated vehicles, especially at levels 3-5, which motivates the research community to actively investigate this emerging area.
As for the used sensors, as seen from Table 9, Nvidia/Audi [212], Waymo [215], and Navya [217], who have achieved automation levels 3-4 differ from the others by using lidars, which enable high precision as sensing devices send light beams to surroundings. However, the reason why the industry tends to mount cameras instead of lidars is their overwhelming cost, as mentioned in Section I.A. There is also a general tendency for using a combination of SoCs and GPUs based high-performance computing platforms, as shown in Table 9, as these can support high-speed processing of large volumes of sensor data.

1) CONNECTIVITY IN ADSs ARCHITECTURES
The idea behind the connected systems is based on a communication network among the vehicles (agents). It is expected that this type of system would allow to advance autonomous driving in a more hierarchical way. Even though there is not any kind of operational system of this type at present, in academia, it is notably believed that this emerging technology will take a high proportion of the future ADSs [219]- [221]. Current ever-developing design approaches to connected systems are: Vehicular Ad hoc NETwork (VANETs) [219], vehicle to everything (V2X) [237], Information-Centric Networking (ICN) [219], and Internet of Vehicles (IoV) [219].
Contrary to connected systems depending on other vehicles on the roads, the ego-only system brings a different approach, whereby all the necessary automated driving operations are managed by a self-sufficient vehicle. At present, this approach is the most preferred system type [26], [218], [46], [178], [222]- [226], as it may be that the enormous development challenges force the industry to focus on this system type for now.

2) ALGORITHMIC DESIGNS FOR ADSs ARCHITECTURES
Modular systems consist of separate linked components that form the pipeline of ADSs, including connections from the sensory inputs to the actuator outputs [150]. It is referred to as a mediated approach in [228]. Fig. 16 (a) demonstrates a state-of-the-art modular system design, which is based on recent publications [15], [16], [238]. The individual modules in Fig. 16 (a) are described in Section 3) below.
Unlike modular systems, end-to-end systems generate egomotion information by using directly the input data provided by the sensors. Ego motion is defined as either the continuous operations on the steering wheel and driving pedals or a set of discrete actions, e.g., acceleration and turning left/right [16]. It is also referred to as a direct reception in [228]. The pipeline of an end-to-end system is represented in Fig. 16 (b). Neuroevolution [235], [236], direct supervised deep learning [228]- [232], and deep reinforcement learning [233], [234] are three different main approaches for these systems. Table 10 summarises the pros and cons of the two algorithmic designs. The key factor of why automakers prefer modular designs compared to end-to-end design is existing literature, which enables easier production even though there are still many shortcomings, primarily error propagations. End-to-end designs still need many efforts for nearfuture production. Particularly, hardcoded safety measures and interpretability remain crucial topics for these designs.

3) MODULES
A variety of sensors are used in the onboard data processing pipelines of cutting-edge ADSs. Commonly used types of sensors are exteroceptive sensors, e.g., ultrasonic, radar, lidar, and cameras, which are responsible for perceiving the vehicle surroundings. They are employed in both modular and algorithmic designs, as seen from Fig. 16 (a) and (b). As previously stated in Section I.A., many leading companies, such as Tesla [12] and Mobileye [13], [14] have mainly focused on camera-based approaches. However, others have used lidar sensors as well, e.g., the Uber car (XC90) [225] deploys 20 cameras and 8 lidars, and VisLab's BRAiVE [222] has 10 cameras and 5 lidars. The data-flow diagram for modular systems is shown in Fig. 16 (a).
Data collected from sensors are utilised for two different purposes: (i) perception and (ii) localisation & mapping. The first requires detector and tracker modules, while the second requires a localizer module, as illustrated in Fig. 16 (a). The perception modules perform the crucial task of perceiving the vehicle environment and extracting meaningful information from the objects. They also play a significant role in a vehicle's navigation system. Perception has some core subtasks which are object detection [116], semantic segmentation [119], road and lane detection [243], [244], and object tracking [245]- [248]. Information extracted by the detector and tracker in the pipeline is combined with the localizer in the fusion module. Then, this information is used in the behaviour prediction module, also known as path planning.
The localization & mapping modules are responsible for finding the vehicle's ego-position relative to a map or a reference frame taken in the surroundings [249]. By means of localization, the vehicle finds its precise relationship with all of the objects on the map. This task is crucial for ADSs VOLUME 10, 2022 similarly to for any other mobile robotic system [250]. This is because vehicles need to estimate their ego-position with a centimetre-level precision [16]. In the industry, mainly the following five localization techniques are used: SLAM [251], Absolute positioning sensors [252], Odometry/dead reckoning [253], GPS-IMU fusion [254], prior map-based [255]- [257], and many of them are lidar-based approaches. There have been suggestions that the future of localization may be shaped by camera-based approaches as their deployment is more cost-efficient [16].
Lastly, a significant part of the ADSs pipeline is responsible for behaviour prediction and planning, which are produced in a route planner module as shown on Fig. 16 (a). Planning modules realise two sub-tasks, global planning and local planning. As the name suggests, a global planner finds a route from point of departure to point of arrival by using the road network, whereas local planning tries to perform a defined global plan with as lower as a possible error. In other words, ADSs strive to reach the final destination by finding trajectories avoiding obstacles and satisfying optimization criteria.
In addition, many modern cars are equipped with navigation systems, which plot a global route by utilising GPS or offline maps. Both, the academy community and industry have proposed a number of global planning methods, such as goal-directed path [258], separator based [259], hierarchical [260], bounded-hop [261]. In addition to these methods, some hybrid studies have been published as well [262], [263]. Recently, several methods on local planning have been proposed too, following different approaches, e.g., graph search [264], sampling-based [265], [266], curve interpolation [267], [268], numerical optimization [269]. Apart from these traditional methods, new approaches to planning have emerged based on DL and reinforcement learning [269], [270], which still have many shortcomings that need to be overcome such as interpretability [239], hard-coded safety measures [242], generalization, training data, etc.
In contrast to modular systems, end-to-end systems aim to produce a similar result in one step, as shown in Fig. 16 (b). However, these systems are still an ongoing research topic and need a lot of development efforts.

D. ADSs DESIGN CONSTRAINTS
According to official information, 94% of the total number of road accidents are due to human errors [271]. Against these grim statistics, ADSs promise to reduce traffic accidents, driving-related stress, emissions, and many others [272]. However, ADSs have been the cause of accidents as well. For instance, Google's ADS [273] hit a bus when changing lanes, as the system did not accurately estimate the speed of the bus. As another example, Tesla's Autopilot [274] failed to detect a truck and it caused a fatal accident resulting in the death of the driver. There are many other such collisions with tragic ends [275], [276]. These unfortunate events show that there are still many issues that need to be addressed during the design process of such systems.
The purpose of this section is to discuss the main criteria that should be handled to realise the object detection algorithms designed for ADSs.
Design of ADSs is a trade-off decision process taking into account a number of requirements on different parts of the system, in terms of performance, predictability of performance, power consumption, storage, thermal constraints, among others [15]. The object detection part of ADSs, for example, must satisfy real-time implementation requirements. In this section, ADSs design constraints will be evaluated based on the open literature, after which a trade-off analysis between accuracy and model size will be discussed.
Mainly, four parameters that affect directly the performance of the vehicle are considered, when designing a new CNN for object classification or object detection, namely accuracy, model size, inference speed, and efficiency. As seen from Table 5 and 7 increasing the accuracy level of CNNs for both classification and detection has been a mainstream research goal, apart of several works, such as [70], [96], [98], [101] on classification and [122], [116] on object detection. Moreover, what is another trend exemplified by Tables 5 and 7 is that the rises in the accuracy have been followed by rises in the model sizes as well. However, while the former is regarded as an added value, the latter is not a favourable result for real-time implementations.
Without a doubt, a novel CNN architecture featuring both low model size and a high accuracy level would bring many advantages to ADSs systems. Some of the benefits of a small model size are low power consumption (efficiency), low latency and high inference speed. It also facilitates over-the-air updates [96], online and distributed training [96], [277], FPGA-based implementation, use of low memory bandwidth [96] and it even leads to lower fuel consumption [15].
Some technical terms about latency and memory types of processors (CPUs) are clarified next. There are two types of CPU memory -internal (on-chip) and external (off-chip) where the former could be of two types: ROM (non-volatile) or RAM (volatile). When the processor needs to process data, it fetches it from an on-chip memory component that is embedded in the processor itself, as shown in Fig. 17. The on-chip RAM is of an SRAM type and the off-chip memory is of a DRAM type. When the processor needs to process data not available in the on-chip memory, it fetches it from an offchip memory component that is not a part of the processor itself.
The on-chip SRAM memory allows reducing the latency (delay time), as generally, it supports a single-cycle access time. SRAM performs as a cache memory for the off-chip DRAM as well. The off-chip DRAM has a much greater latency of 1000× cycles more compared to the on-chip SRAM [278] as illustrated in Fig. 17. Due to its high cost, SRAM memory is preferred at low sizes. Although DRAMs are low-priced memories, they are quite power-hungry due to their internal design features [279].

1) PERFORMANCE AND PREDICTABILITY
Being candidates for preventing car accidents, ADSs need the capability of understating their surroundings and reacting to them in an acceptably fast way. Despite the promise for such a potential, the real-time performance requirements for ADSs are still largely undefined.
According to Mody [280], the reaction time of ADSs is defined by two criteria, frame rate and processing latency. While frame rate denotes how fast the data coming from realtime sensors can be fed into ADSs, the processing latency is defined by how fast the system reaction to every piece of fed data (frame) is. Following that, Lin et al. [15] state that ADSs should be able to react or process current traffic events within the latency of 100ms and support a frequency of at least 10 fps in object detection, as well. These metrics are based on the actual real-time performances of the human driver [271], [281]- [284].
In [15], the reaction performance of human drivers is used to systematize the ADSs real-time performance. Even though the human being is a very complicated non-linear system, and its behaviour cannot be assessed easily [285], [286], it is shown in [287], [288] that human ability for image classification outperforms DNNs indisputably. Ideally, in operation, a 100% accuracy rate is expected from object detectors in finding objects of interest. However, such an accuracy, which is on par with human abilities, is yet to be achieved by ADSs, and it is a great challenge in the field. In summary, the only agreed so far target to be reached by ADSs is up to %100 accuracy rate, and the evaluation of human behaviours on detection accuracy is yet to be finalised.
Three performance targets of ADSs were defined so far: processing latency, frame rate, and accuracy. The predictability constraint is determined by the reliability and quickness of the ADSs reactions to the real-time traffic conditions. In general, they are defined with both temporal aspects, like timing deadlines, and functional aspects, like making the correct operational decisions. Not being able to respond reliably within a specific deadline may endanger passengers and could result in fatal accidents. The performance of ADSs, therefore, needs to be exceedingly predictable to be adopted in real driving scenarios.

2) POWER
As mentioned previously in Section I.A., Intel [10] has stated that fully developed ADSs will be required to process approximately 1GB of data per second of its real-time operation. This data needs to be processed very fast so that vehicles can react to their surroundings by keeping real-time performance constraints. Consequently, that results in huge power consumption as well. In addition, as stated in [15], a powerhungry system could cause a significant downgrade of the fuel efficiency of the vehicle, i.e., up to 11.5%.

3) STORAGE
In localization tasks of ADSs, tens of TBs data are required to store the prior maps [15], [255]- [257]. For instance, a prior map of the USA is 41 TB [289], which need to be stored while driving. The reasons why such prior maps are needed are the lack of precision and accessibility in GPS technology [290], [291].
Apart from the above, there are also many other constraints, such as thermal [15], privacy [10], hardware reliability, etc.

4) TRADING-OFF ACCURACY AND SIZE
The low model size of CNN-based object detectors enables many advantages, but their accuracy is not so high. In view of this, here we evaluate the trade-off between high accuracy and low model size.
The benefits of low model size CNN designs come from the feasibility of their implementation on embedded computing platforms, as previously pointed out with ablation experiments in Section III.D. and summarized in Table 11. The high performance of GPUs in terms of processing latency is mostly due to the use of large on-chip memories, which have a very short access time. Nevertheless, despite their quite low latency, they have a high-power consumption. This is also confirmed by [15] with experimental results based on testing of a number of real-time implementations of CNN-based object detection, tracking, and extraction using CPUs, GPUs, and FPGAs. Moreover, [15] reveals that GPUs are outperformed by FPGAs in respect of energy consumption.
In conclusion, there are many benefits of reducing the model size of CNN-based detectors as well as making them capable of state-of-the-art accuracy rates as seen in Table 11. VOLUME 10, 2022 Therefore, there is a need of improving low model size CNN designs by increasing their accuracy to a favourable level. Such developments would expedite the use of CNNs in the ADSs design process by allowing CNN deployment in lowpower FPGA-based implementations despite the low on-chip memory resources.

E. PROMISING DIRECTIONS
Here we discuss the emerging trends in the two main areas covered above: object detection and ADSs.

1) OBJECT DETECTION
Several prominent directions could be noted in the area of object detection.
• Firstly, as seen from the object detection designs in Section IV, while the two-stage detectors are only focused on accuracy, the one-stage detectors are concerned primarily with the model size and efficiency. Thus, each of them sacrifices a design parameter in their design. From this perspective, utilizing one-stage and two-stage detectors by combining their superior sides is a big challenge enabling higher accuracy and lower latency together in the detectors.
• Another trend is troubleshooting video data streaming problems, such as video defocus, motion blur, intense target movements, etc. so that their detrimental effects on performance are removed.
• The third direction is related to the used machine learning method in CNN training. To avoid the long time and inefficiency of the training process in supervised learning, unsupervised learning methods have started to be employed as an emerging research direction.
• Another non-CNN based trend is using GAN-based detectors [298]. The necessity of a large amount of data in deep learning has several drawbacks. To overcome this problem, GAN-based detection methods are expected to become a promising and ever-increasing area.
• Finally, decreasing the model size of detectors brings a large number of advantages. However, the design process has to be managed efficiently so that the accuracy level is improved as well.
Apart from these, there are also many other trends, such as multi-task learning [299], assistance with multi-source information, e.g., social media, big data, etc.

2) ADSs
Similarly to detectors, the area of ADSs has several promising directions, as follows.
• The first is 3D object detection, which is a challenging area due to the cost of the sensors. Although lidars are quite efficient in 3D object detection, their costs at present are too high to be deployed in the ADSs. However, the approach is highly innovative.
• Connected systems promise a huge potential in improving the level of autonomy, but they require communication among road users, which is a complex and challenging problem that is yet to be addressed.
• The third one is human-machine interaction, which is an attractive area, required in designs at automation level 3 or upper levels. However, mutual understanding between cars and human requires further research work.
• Lastly, establishing ADSs requirements for real-time applications is another important area. As discussed in Section V.D., despite the fact that human behaviours are not easy to be evaluated, investigations of parallels between humans and machines have already been undertaken towards formulating commonly held ADSs regulations in real-time applications [15].
In addition to the above, other ADS trends are deep learning-based route planners [300], multi-task networks in ADSs [301], etc.

VI. CONCLUSION
This paper provides a comprehensive review of the literature on the contemporary state-of-the-art of Convolutional Neural Networks for image classification and detection as well as Autonomous Driving Systems. Layer-based details of CNNs along with parameter and floating-point operation number calculations are outlined.
Using an evolutionary approach, the latest CNN models for image classification are discussed covering the most up-to-date developments. A novel timing analysis aimed at assessing the impact of commonly used convolution types on CNN performance is presented running a reference CNN model on a desktop computer and two powerful GPUs. This extensive experimental study provides a new insight into the performance of each convolution type in terms of training time, inference time, and layer level decomposition.
Building blocks for CNN-based object detection are also discussed, such as backbone networks and baseline types, and then representative state-of-the-art designs are outlined. Experimental results from the literature are summarised for each of the reviewed models based on particular image datasets.
Most recent ADSs related works on CNN-based object classification and detection, current ADSs technologies and promising directions are outlined. In addition, a comprehensive trade-off analysis of ADSs from a human-machine perspective is presented. Thus, the paper deals with a broader study area compared to previous reviews, covering new trends and improvements that are expected in the near future, as well. In particular, it is highlighted that CNN models that feature low model sizes and achieve satisfactory accuracy have a promising future. It is our strong belief that with the necessary real-time hardware capabilities and successful resolution of design constraints, CNNs would enable a safer and more efficient self-driving cars and a rise in their popularity.

APPENDIX: DESIGN SPACE EXPLORATION
Sections III and IV above have highlighted that it is by innovating and perfecting the existing design approaches and techniques that researchers have been advancing so effectively the CNN state-of-the-art. The investigation of the literature reveals that the success of CNN models depends on the design strategies in regard to the CNN micro/macro architecture parameters as well as optimization algorithms, activation functions, compression techniques, learning types, normalization [302], regularization [303], loss functions [304], data augmentation [305], RoI feature extractors [122], Region proposal algorithms [120], etc. Therefore, in this section, we review some of the most important areas of design space exploration for classification and object detection CNNs.

A. OPTIMIZATION
Optimization algorithms have played a significant role in the training of machine learning models in recent years. In particular, iterative or coordinate descent methods, have been instrumental in increasing model performance. However, the optimization field has gradually faced more challenges with the growth in model complexity and data size, as reported in [19]. A summary of the algorithms, employed in CNN, is presented below, dividing them into three main categories: first-order, high-order, and heuristic derivative-free.

1) FIRST-ORDER ALGORITHMS
Gradient descent algorithms are the most used algorithms in first-order optimization methods. They depend on the firstorder derivatives of the loss functions, represented in Jacobian matrices. In its simplest form, the weight matrice θ is defined by where µ is the learning rate and ∇J (θ) is the Jacobian matrix of the cost function J(θ). Three gradient descent variants are used depending on the number of examples (m) from a dataset, which the algorithm uses to compute the gradient objective in each iteration. The first variant, Batch gradient descent, uses all m examples in every iteration and is calculated with ;y (i:i+m) ), where x represents the input features of the i th training example and y is the real value for the i th example. Its drawback is a possible slowness, as it calculates the gradients for the whole dataset. The second one is Stochastic gradient descent (SGD) that uses only one example in each iteration. Its limitation is that SGD performs updates frequently with a high variance that produces intense fluctuations in the objective function. Its function is θ = θ − µ · ∇ θ J(θ;x (i) ;y (i) ).
The last variation is Mini-batch gradient descent that uses n = m/x examples, where x is selected according to the application, in each iteration. The method, which is defined by the expression below, enhances performance, as it reduces the variance effects and computational load.
Among these three algorithms, the SGD method [306], [307] and its variants have been widely used in the last years, as a representative of first-order optimization methods. It is crucial to state that the SGD term can be interchangeably used when mini-batches are used as well. It is also quite significant to pay attention to the application scope and characteristics of these methods while selecting them for application in DNNs.

2) HIGH-ORDER ALGORITHMS WITH HESSIAN MATRIX
Compared to the first-order methods, the high-order optimization methods [315]- [317] converge most correctly to the local or global minimum as the curvature information of these methods is more powerful. In spite of the attractive features, they face multiple challenges. Second-order algorithms present problems with the computation and storage of the inverse matrices of the Hessian matrices, which are used in the high-order optimization to hold the second or higherorder derivatives, in each iteration of the training process.

3) HEURISTIC DERIVATIVE-FREE ALGORITHMS
The heuristic derivative-free algorithms in [322], [329] have been proposed to address the case when the derivative information of the target functions may not exist or be hard to calculate. They are based on two root ideas. While the first one is a heuristic investigation with empirical rules [330], the second one is trying to fit the function with samples [331].

B. CNN MICRO/MACRO ARCHITECTURE
Microarchitecture parameters of CNN are related to layerbased adjustments, such as filter dimensions, pooling operation types, etc. There is a variety of those parameters among CNN architectures. For instance, while VGGNet [75] employs 3×3 filters, Inception-v1 [29] makes use of different sizes of filters, such as 1 × 1, 3 × 3, and 5 × 5. In addition, as CNNs tend to go deeper, manual adjustments of kernel types and input/output dimensions become impractical in individual layers. To address the problem, higher-level predesigned blocks or modules, which involve different sizes and types of kernels, bypass connections, activation functions, etc., have been proposed to lighten designers' effort. An example of modules is the inception modules, which were used in the design of the Inception-v1 CNN, also known as GoogLeNet, where many of them are combined with ad-hoc layers to construct a complete network.
Whereas the properties of individual layers and modules represent microarchitecture, CNN macro architecture concerns the system-level organization in its entirety or the endto-end algorithm pipeline. The selection of depth, i.e., the number of layers or modules, and the choice of connection types across multiple layers or blocks can be listed as the two primary areas of the CNN macro architecture. In VGGNet, increasing the number of layers and modules gives a better accuracy performance at the system level. An example of connection types between the layers is the FCLs that are visualised in Fig. 1 for AlexNet [28]. In addition, several macro architecture strategies were proposed in ResNet-50 [17] that advanced the state-of-the-art in its publication year. Several other CNN designs [103], [116], and [122], which are introduced before, have gained success with innovative macro architecture-based strategies as well.

C. COMPRESSION
Although Deep Neural Networks (DNNs) have obtained great success in various computer vision tasks, many existing designs are computationally intensive and require a large amount of memory. This prevents their implementation on hardware with limited memory like FPGAs and uses in applications that run on strict latency constraints. For these reasons, compression of a pre-trained algorithm into a lightweight one without ceding from the model performance is a frequent technique to enable real-time deployments of the models. Four approaches exist, which are pruning, tensor decomposition, quantization, and knowledge distillation.

1) PRUNING
Pruning reduces the redundant parameters, which have no effect on the performance. Thus, weight matrices are turned into spare ones. Parameter pruning is quite robust over various settings in the layer and can support training from scratch or pre-trained models. Additionally, it helps with the overfitting problem. More information on pruning could be found in [332]- [334].

2) QUANTIZATION
Quantization is to adopt low-bit number representations instead of floating-point ones for every weight parameter. Currently, the main research tendency on quantization is model binarization. For instance, Binarized neural networks [335] quantize the weight values to 1-bit representations based on BinnaryConnect as well as changing the activation values to 1 bit. Thus, it reduces memory usage and simplifies many multiply-add operations into bitwise operations XNOR-Count. Another example is XNOR-Net [336], which achieves 32× compression, providing a 52× speed rise. Several other techniques [337]- [342] have been proposed as well.

3) KNOWLEDGE DISTILLATION
The idea behind the knowledge distillation approach is related to a type of migration learning, in which a complex architecture, called teacher model, is used to train a compact algorithm, called student model, by means of distilling knowledge from the teacher. The knowledge distillation approach is used in both classification and detection algorithms. However, the technique is quite sensitive to the range of applications and there is no pre-trained model option. An example of distillation is EfficientNet-L2 [80], in which it is applied quite effectively. Improved versions of the technique can be found in [343]- [351].

4) TENSOR DECOMPOSITION
The aim of tensor decomposition is to exploit channel or spatial-wise redundancies of convolution filters and to seek the lower rank approximations of them. Further details are reported in [352]- [356].

D. MACHINE LEARNING METHODS
Here we briefly touch upon the current machine learning methods, as choosing the right machine learning approach in ADSs pipelines may allow improving their performance [80], [85].
Based upon the addressed problem and data type, machine learning approaches can be broadly divided into three groups: primary approaches, hybrid approaches, and other types. Primary approaches comprise supervised [357], unsupervised [358], and reinforcement learning [359]. Hybrid approaches, the use of which has become quite prevalent in the last few years [81], could be subdivided into three groups: semi-supervised learning [81], self-supervised learning [360], and self-taught learning [361].

ACKNOWLEDGMENT
This work used the ALICE High-Performance Computing Facility at the University of Leicester.