Improved Vision-Based Vehicle Detection and Classification by Optimized YOLOv4

Rapid and precise detection and classification of vehicles are vital for the intelligent transportation systems (ITSs). However, due to small gaps between vehicles on the road and interference features of photos, or video frames, containing vehicle images, it is difficult to detect and identify vehicle types quickly and precisely. For solving this problem, a new vehicle detection and classification model, named YOLOv4_AF, is proposed in this paper, based on an optimization of the YOLOv4 model. In the proposed model, an attention mechanism is utilized to suppress the interference features of photos through both the channel dimension and spatial dimension. In addition, a modification on the Feature Pyramid Network (FPN) part of the Path Aggregation Network (PAN), utilized by YOLOv4, is applied in order to enhance further the effective features through down-sampling. This way, the objects can be steadily positioned in the 3D space and the object detection and classification performance of the model can be improved. The results, obtained through experiments conducted on two public data sets, demonstrate that the proposed YOLOv4_AF model outperforms, in this regard, both the original YOLOv4 model and two other state-of-the-art models, Faster R-CNN and EfficientDet, in terms of the mean average precision (mAP) and F1 score, by achieving respective values of 83.45% and 0.816 on the BIT-Vehicle data set, and 77.08% and 0.808 on the UA-DETRAC data set.


I. INTRODUCTION
Presently, object detection and classification are widely used in intelligent transportation systems (ITSs), and some industrial and military systems. ITSs, for instance, can perform vehicle detection and classification for comprehensive analysis of passing vehicles for the purposes of accomplishing an efficient vehicle traffic management and control, and urban planning. The existing object detection methods can be divided into two groups [1]: (i) hardware-based; and (ii) vision-based methods. The latter try to localize an object in a photo / video frame by making bounding boxes (BBoxes) around the detected objects. If, in addition, object classification is performed, then the predicted class label is also shown on the image along with the confidence score associated with each bounding box (BBox) [2]. The vision-based object detection methods can be further categorized, as per [1], as: (i) dimension-based; (ii) logobased; and (iii) feature-based methods. The traditional (before 2012) feature-based object detection methods, such as Haar [3], the histogram of oriented gradients (HOG) [4], the HOG-support vector machine (HOG-SVM) [5], etc., consist of three parts [2]: (i) informative region selection; (ii) feature extraction; and (iii) classification. Later, these methods were replaced by deep learning (DL) feature-based methods, due to the continuous growth of large volumes of data (Big Data) and the fast development of (multicore) processors and Graphical Processing Units (GPUs) [2]. Currently, DL feature-based methods are considered the state-of-the-art, thanks to their remarkable object detection accuracy and operational speed. Compared to traditional feature-based methods, which use manual extraction of features done by experts, DL methods can automatically learn feature characteristics from huge volumes of data over time [2].
Currently, many object detection models are based on convolutional neural networks (CNNs) due to their rich representation power [6]. The visual recognition performed by the CNN feature extraction is close to the visual mechanism of human beings [2]. A typical CNN contains different types of layers, e.g., convolutional, pooling, fully connected, etc., whereby each layer transforms the 3D input volume to a 3D output volume of neuron activations [1]. Different CNN architectures have been elaborated to date. Among these, the Region-based CNN (R-CNN) was the first one that successfully has applied DL for object detection and other computer vision tasks by realizing automatic extraction of image features. Recent advances in object detection are indeed driven by the success of R-CNNs, whose cost has been significantly reduced thanks to sharing convolutions across object proposals [7]. The original R-CNN model was followed by other incarnations, such as Fast R-CNN [8], Faster R-CNN [7], Mask R-CNN [9], and Mesh R-CNN [10]. All these are representatives of the two-stage object detection models, which first generate a series of sparse candidate frames (i.e., region proposals extracted from a scene), followed (in the second stage) by candidate frames verification, classification, and regression to improve the scores and locations [2]. Among the pros of these models are the high accuracy and localization achieved in object recognition, whereas the more complex training required and lower operational speed achieved are their main cons [2], especially if taking into account that real-time object detection is now playing an increasingly important role in practical applications. In this regard, members of the other group of single-stage object detection models, such as You Only Look Once (YOLO) [11] and Single Shot MultiBox Detector (SSD), perform better by directly adopting a regression method for object detection, resulting in higher operational speed. However, SSD does not consider the relationship between different scales, so it has limitations in detecting small objects, whereas YOLO is easier to learn general features, and its operational speed is higher [12]. Both SSD and YOLO, however, cannot perfectly handle the graphic area, resulting in high detection error and missing rates.
In 2017, Google proposed a new idea to completely replace the traditional CNN and recurrent neural network (RNN) structures with an attention structure, which was subsequently widely adopted for use in different fields. With the attention mechanism utilized for object detection, new weights are added to images for marking, learning, and training so as to achieve the effect of paying attention to more important areas, and solve the problem of information resources allocation.
Regarding the object classification, a model which is optimal for this task may not be optimal for object detection. As pointed out in [13], in contrast to a classifier, a detector requires: (i) a larger size of the input network (i.e., a higher resolution) for detecting multiple objects of small sizes; (ii) more network layers for increasing the receptive field and covering the larger size of the input network; and (iii) more parameters for improving the model performance in detecting multiple objects of different sizes in a single image.
The objective of this paper is to come up with a novel model, called YOLOv4_AF, based on a YOLOv4 optimization, as to achieve better balance between the two tasks (object detection and object classification) and get better performance results for vision-based vehicle detection and classification, compared to the state-of-the-art models used for the same purpose. For this, YOLOv4_AF utilizes an attention mechanism to suppress interference features of photos / video frames through both the channel dimension and spatial dimension. In addition, a modification of the Feature Pyramid Network (FPN) part of the Path Aggregation Network (PAN), used in YOLOv4, is proposed and implemented, followed by a maximum pooling for feature fusion. The novelty of the proposed model is in the combined use of these new ideas, incorporated into the original YOLOv4 model, allowing to achieve better performance results for vehicle detection and classification, as presented further in the paper.
The rest of the paper is organized as follows. The next section presents the background information, followed by the related work considered in Section III. Section IV describes the proposed YOLOv4_AF model, followed by the experimental Section V. Finally, Section VI concludes the paper.

A. CONVOLUTION
The physical meaning of the convolution process, used in CNNs, can be generally summarized as follows. The output of a system at a given point of time is generated through a joint effect (superimposition) of multiple inputs. In LeNet-5, an early CNN proposed by LeCun et al. in [14], the original image can be gradually converted into a series of feature maps through alternately connected convolutional layers and downsampling layers, which allows objects, presented in the image, to be easily classified according to these maps. Later, through gradual improvement and development, AlexNet [15], GoogLeNet [16] [17], ResNet [18], VGG-16 [19], etc. emerged, whereby various convolutional models can be selected independently. The purpose of the convolution process is to convert the input image into a matrix according to the corresponding pixel value. For instance, if a 6×6 pixels image is considered for convolution with a 3×3 convolution kernel and a step size of 1, then a 4×4 feature map is formed as shown in Figure 1. In reality, the data volume is relatively large, and the computing workload of convolution is also quite heavy, so some pre-processing is required before convolution computation, e.g., compressing the output of the previous layer as to lessen the workload, instead of causing too many losses. For instance, with a Region Proposal Network (RPN), the Faster R-CNN model, described in Subsection III.A.2, inputs the convolutional features into the RPN to obtain candidate frame information, thus replacing the original R-CNN's selective search and, by this, improving the frame detection speed. Another approach, e.g., utilized by YOLO as described in Subsection III.B, uses just one convolution to achieve faster processing, which is suitable for real-time applications.

B. ATTENTION
The attention mechanism originates from the research conducted on human vision. In information processing, people selectively focus on part of the information and ignore the rest due to the effect of cognitive processing. Called "attention", this mechanism performs two main functions [20], i.e., deciding on: (i) which part(s) of the input the attention needs to be paid to; and (ii) which important parts of the input the (limited) information processing resources should be allocated to. According to the characteristics of the attention effect, the attention mechanisms could be divided into term-based and location-based ones. Based on the domain, the attention mechanisms could be classified into spatial, channel, and mixed domain attention mechanisms. Overall, attention mechanisms have shown significant potential in the area of object detection due to their intuitiveness, versatility, and interpretability [21].
The YOLOv4_AF model, proposed in this paper, utilizes the lightweight general-purpose convolutional block attention module (CBAM), proposed in [6], which enables the network to notice the informative parts of the images [22]. By combining a channel-wise and spatial-wise attention, CBAM can achieve better detection results than the one-type attention mechanisms, such as the channel-attention based Senet [23]. This is because the spatial-wise attention focuses on 'where' is an informative part, whereas the channel-wise attention focuses on 'what' is meaningful in the input image [6]. For this, CBAM utilizes two modules -a channel attention module (CAM) and a spatial attention module (SAM)-, which could be placed either in parallel or sequentially. However, the sequential arrangement outperforms the parallel one. For the former, the CAM-first order (shown on Figure 2) is slightly better than the SAM-first order, as proved in [6], and thus it was utilized in the proposed YOLOv4_AF model. CAM utilizes different feature information, whereby each channel of the features represents a special detector. After adding the features together, the activation function gets their weights and figures out what features are meaningful. SAM is spliced with the channel descriptions, obtained through average pooling and maximum pooling, and processed by the convolutional layer to obtain the weight coefficients.
In general, CBAM is easy to use, can be integrated into any CNN architecture in a seamless way, can be trained on an endto-end basis, and its overall overhead is quite small both in terms of computation and parameters. In addition, it shows great potential for use on low-end devices [6].

C. FEATURE FUSION
In object detection, an important means for improving the segmentation performance is to fuse features of different scales. The resolution and information of different layers are different. The lower-layer features undergo less convolution, resulting in lower semantics but higher resolution which can provide more location information. The result is opposite for upper-layer features, where the lower resolution leads to poor perception of detailed information. The object detection performance of a model could be improved through fusion of multi-layer features [24]. Depending on which one (fusion or prediction) takes place before the other, fusion is divided into two types -early fusion and late fusion. Early fusion includes classic feature fusion methods such as concatenation, addition, etc., whereas late fusion combines the detection results of different layers. For instance, FPN [25,26] uses both highresolution and high-semantic information to achieve better results by fusing these different lower-layer and higher-level features. For this, FPN first performs pyramid fusion followed by prediction performed separately on each fused feature layer. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. combines upper-layer feature information through upsampling to obtain the prediction map.
As the conventional top-down FPN is limited by the oneway information flow [27], in YOLOv4, described in Subsection III.B.4, after FPN, an extra bottom-up pathway aggregation is added, resulting in the so-called Path Aggregation Network (PAN) [28]. The FPN structure, utilized by YOLOv4, is shown in Figure 3, where represents the th ResNet convolution group (i.e., the th feature level with resolution of 1/2 of the input image) and represents the th feature map obtained after a 1×1 convolution of . Fusion with the up-sampled feature maps is used to obtain the new feature maps 4 , 3 , and 2 from the corresponding features of 4 , 3 , and 2 , respectively. After the addition, a 3×3 convolution is used by the fusion to generate the final feature map.

III. RELATED WORK
In general, each object detector consists of five parts [29]: an input, a backbone, a neck, a head, and an output. The (pretrained) backbone could be [13]: (i) VGG-Nets [30], ResNet [18], ResNeXt [31], DenseNet [32], etc., for the GPU-based detectors; and (ii) SqueezeNet [33], MobileNet [34] [35], ShuffleNet [36] [37], etc., for the central processing unit (CPU)-based detectors. Composed of several bottom-up and top-down paths, the neck is usually used to collect feature maps from different stages [13]. Examples of networks with such mechanism include FPN [25,26], PAN [28], BiFPN [27], and NAS-FPN [38]. The head part, used to predict classes and BBoxes of objects [13], may comprise either one stage or two stages. Correspondingly, the object detection models that adopt CNNs are divided into two main groups [12]: 1) Two-stage models, mostly represented by the R-CNN series, which extract and reposition the candidate frame and generate detection output through independent CNN channels. Having demonstrated an exceptional precision, these models, however, have much lower operational speed compared to that of the one-stage models, and thus cannot meet the requirements for real-time operation [12]. 2) One-stage models, mostly represented by SSD [39] and different versions of YOLO, which skip the process of generating the selected area through the candidate framework and directly generate the category probability and location coordinate value of the object to be detected, identified, and classified, which increases their operational speed despite the slight flaw in precision. In addition, these models are easier to optimize [2]. The main (anchor-based) representatives of these two groups are briefly described in the subsections below.

A. R-CNN
R-CNN achieves excellent object detection accuracy by using deep CNNs to classify object locations, a.k.a. "object proposals" [7]. For this, it trains CNNs end-to-end to classify the proposal regions into object categories or background. Its accuracy depends on the performance of the region proposal method (adopted as an external module performing the selective search [7]), used to extract regions of interest (RoIs). Generally, a complex, precisely tuned, and very slow pipeline is used, whereby first the selective search generates potential BBoxes, then a CNN extracts features, an SVM scores the BBoxes, a linear model adjusts BBoxes, and finally a non-max suppression eliminates duplicate detections [11]. In addition, if images of different size are used, a scaling process must be applied.
In general, as pointed out in [7], R-CNN suffers from remarkable drawbacks, such as slow object detection (due to performing a CNN forward pass for each object location, without sharing computation) and expensive (both in space and time) multi-stage pipeline training. The emerged incarnations of R-CNN are described briefly in the following subsections.

A.1. FAST R-CNN
Fast R-CNN was proposed by Ross Girshick in 2015 for object detection [8]. Compared to the regular R-CNN which independently computes the output features on each RoI, Fast R-CNN runs the CNN only once on the entire image. A singlestage training is utilized, using a multi-task loss, which can update all network layers, and jointly learns to classify object proposals and refine their spatial locations. In addition, for feature caching, no disk storage is needed.
Firstly, Fast R-CNN inputs the image and multiple RoIs into a fully convolutional network. Then, each RoI is pooled into a fixed-size feature map to produce a feature vector by fully connected layers. There are two output vectors per RoI: softmax probabilities and per-class BBox regression offsets ( Figure 4).
Compared to R-CNN, Fast R-CNN circumvents the redundant feature extraction operation and performs feature extraction only once for the whole image's full region, combining a RoI pooling layer and introducing suggestion frame information to extract the corresponding suggestion frame features. Similarly to R-CNN, Fast R-CNN utilizes the very deep network VGG-16, but trains it 9 times faster. Moreover, it is 213 times faster than R-CNN in testing and achieves higher mean average precision (mAP) on PASCAL VOC 2012 [8]. Having reduced the running time of the detection network, Fast R-CNN has exposed the region proposal computation as a main speed bottleneck [7].

A.2. FASTER R-CNN
Faster R-CNN was proposed by Ren et. al in 2017 [7]. It replaces the selective search, used in R-CNN and Fast R-CNN, with a neural network to propose BBoxes [11]. By introducing a RPN, which shares full-image convolutional features with the detection network, it enables nearly cost-free region proposals. Working on an input image, RPN outputs a set of rectangular object proposals, each with an objectiveness score. RPN is trained end-to-end (by back-propagation and stochastic gradient descent) to generate high-quality region proposals (with a wide range of scales and aspect ratios), which are then used by Fast R-CNN for object detection. The main training scheme alternates between fine-tuning for the region proposal task and fine-tuning for object detection, while keeping the proposals fixed, which allows the scheme to converge quickly. In addition, RPN and Fast R-CNN are merged into a single network by sharing their convolutional features with attention mechanisms, whereby RPN tells the unified network where to look. More specifically, CNN and the maximum pooling layer are firstly used to extract input image features and obtain the feature map. For each candidate region, the pooling layer in the RoI extracts feature vectors with fixed lengths for operations and sends them to the multiclassifier for classification. Finally, a BBox regression is performed, the offset value of the graph frame coordinates is obtained, and the sample frame is modified to classify the region, as shown in Figure 5. Recently, Wang et al. have optimized the Faster R-CNN structure and improved the model performance by adding a FPN [40]. The model performs well when trained and tested using single-scale images, which also benefits the running speed. Overall, Faster R-CNN is not only a cost-efficient model, but also presents an effective way for improving the accuracy of object detection [7]. It is the current leading model used in several benchmarks [9]. Thus, it was selected as the main representative of the R-CNN group for performance comparison with the YOLOv4_AF model proposed in this paper.

A.3. MASK R-CNN
Mask R-CNN [9] extends Faster R-CNN by adding a branch for predicting segmentation masks on each RoI, in parallel with the existing branch for classification and BBox regression. The mask branch is a small fully CNN applied to each RoI, predicting a segmentation mask in a pixel-to-pixel manner.
Basically, Mask R-CNN combines elements of object detection (i.e., identifying individual objects and localizing each of them using a BBox) and semantic segmentation (i.e., classifying each pixel into a fixed set of categories without differentiating object instances) [9]. Firstly, Mask R-CNN uses a CNN to extract the features of the input image through a RPN and gets several feature maps of different size ( Figure  6). From these maps, it obtains the candidate frames RoIs. Then, in order to solve the problem of misalignment between the mask and the object in the original image, it uses a quantization-free RoIAlign layer that faithfully preserves exact spatial locations, followed by a BBox regression, a mask branching on the RoIs, and RoIs generating. In general, Mask R-CNN is simple to train but adds a (small) overhead compared to Faster R-CNN. In addition it is not optimized for speed [9].

A.4. MESH R-CNN
To solve the problem of the 2D target detection ignoring the 3D spatial information, Facebook researchers proposed the Mesh R-CNN model [10], which can detect multiple objects in varying lighting conditions and different contexts. The Mesh R-CNN augments Mask R-CNN with a mesh prediction branch, which outputs meshes with a varying topological structure by first predicting coarse voxel representations that are converted into meshes and refined with a graph CNN operating over the mesh's vertices and edges [10]. This allows the model to detect multiple objects in the input image and to achieve fine prediction of arbitrary geometric structures and the corresponding 3D triangular mesh for each object, successfully extending the capability of the 2D perception to 3D target detection and shape prediction.

B. YOLO
YOLO is a family of models started out in 2016 by Joseph Redmon [11]. With its different versions, YOLO presents a new approach to object detection as it only needs to "look" once at an image to detect the objects and their locations on it. For this, instead of repurposing classifiers to perform detection, it frames object detection as a single regression problem to spatially separated BBoxes and associated class probabilities, which are predicted by a single CNN directly from the entire image in one step. YOLO trains on full images and directly optimizes its performance for object detection. Compared to traditional models, it provides the following advantages [11]: (i) higher operational speed, which makes it suitable for use in real-time applications, such as video surveillance and live web cameras (all R-CNN representatives fall short of real-time performance); (ii) global reasoning about the image when making predictions, resulting in less number of background errors (e.g., 50%+ less errors compared to Fast R-CNN); (iii) highly generalizable, which makes it less likely to break down when applied to new domains or unexpected inputs (for instance, when trained on natural images and tested on artwork, it outperforms R-CNN by a wide margin). However, in terms of detection precision, it is outperformed by R-CNN due to struggling to precisely localize small objects [11].
YOLO versions, briefly presented in the following subsections, are the most balanced object detectors in terms of detection accuracy and operational speed [2]. However, a new set of object detection models, called EfficientDet [27], has been recently proposed, utilizing a weighted bi-directional FPN in trying to achieve better accuracy and efficiency [2]. That is why EfficientDet was also included in the performance compassion of models, presented in Section V.

B.1. YOLOv1
The first version of YOLO [11] achieves good detection speed by dividing a single image into multiple grids, and classifying and retrieving each grid [12]. It uses a CNN architecture, similar to GoogLeNet, consisting of 24 convolutional layers followed by two fully connected layers. But instead of the GoogLeNet inception modules, it uses 1×1 reduction layers followed by 3×3 convolutional layers. There is also a fast version of YOLOv1, called Fast YOLO, which uses just 9 instead of 24 convolutional layers and fewer filters in those layers, with the same training and testing parameters. Compared to other models used before for real-time object detection, Fast YOLO is more than twice as accurate [11]. However, YOLOv1 has limitations in the training data size, the input image needs to be adjusted to the size of 448×448×3 before being processed by the CNN, and finally the results need to be thresholded by the confidence of the model. In addition, [11] points out to other limitations of YOLOv1, such as: (i) strong spatial constraints on BBox predictions, limiting the number of nearby objects that can be predicted, e.g., small objects appearing in groups, such as flocks of birds; (ii) struggling to generalize to objects in new or unusual aspect ratios and/or configurations; (iii) using relatively coarse features for predicting BBoxes; (iv) incorrect object localizations (e.g., compared to Fast R-CNN, YOLO makes a significant number of localization errors [41]) due to using a loss function that treats errors the same way in small and large BBoxes. Moreover, YOLO has relatively low recall, e.g., compared to R-CNN [41].

B.2. YOLOv2
Compared to YOLOv1, YOLOv2 [41] is: (i) better, in terms of accuracy (outperforming also Faster R-CNN); (ii) faster (outperforming also Faster R-CNN); and (iii) stronger (more than 9000 different object categories can be detected in realtime). In addition, it can run at varying sizes, offering an easy tradeoff between speed and accuracy. The YOLOv2 variation, YOLO9000, can jointly train on object detection and classification, which allows it to predict detections for object classes that do not have labelled detection data. The objects are classified using a hierarchical view, allowing to combine distinct data sets together, whereby a detection data set could be expanded with a huge amount of data from a classification data set. Compared to YOLOv1, the following improvements have been made in YOLOv2, as pointed out in [41]: (i) batch normalization on all convolutional layers (resulting in significant improvement in convergence and 2%+ improvement in mAP); (ii) high-resolution classifier (leading to an increase of almost 4% of mAP); (iii) removing the fully connected layers and using convolutional with anchor boxes to predict BBoxes (resulting in a mAP decrease, but increasing the recall); (iv) dimension clusters; (v) direct location prediction (making the parametrization easier to learn and the network more stable); (vi) fine-grained features (resulting in 1% performance increase); (vii) multi-scale training (allowing the same network to predict detections at different resolutions), etc.

B.3. YOLOv3
YOLOv3 [42] introduced different small changes in the design of YOLOv2 to improve it. Compared to YOLOv2, it is a little bigger but more accurate. While still predicting BBoxes using dimension clusters as anchor boxes, it predicts an objectiveness score for each BBox using logistic regression.
In addition, instead of softmax, YOLOv3 uses independent logistic classifiers. During training, it utilizes a binary crossentropy loss for the class predictions. It predicts boxes at three different scales, whereby features from these scales are extracted using a similar concept to FPN. For feature extraction, it uses a new neural network, called DarkNet53 -a hybrid between the YOLOv2 network, DarkNet19, and an innovative residual network-, which better utilizes the GPU, making it more efficient and faster. Overall, YOLOv3 improves the object detection precision while maintaining the operational speed. At the same time, it has got an enhanced ability to identify small objects, thanks to the utilized multiscale prediction, but comparatively worse performance on medium-and large-size objects. In addition, the speed and precision can be traded off by changing the size of the model. However, as the intersection over union (IoU) threshold increases, the model's performance decreases and YOLOv3 does not fit well with the ground truth [12]. For sonar data sets with small effective samples and low signal-to-noise ratios, an improved YOLOv3 algorithm, named YOLOv3-DPFIN, was proposed in [12] for accurate real-time detection (e.g., in underwater applications) of noise-intensive multi-category sonar objects/targets with minimum time consumption, achieving higher precision and speed than the original YOLOv3 model.

B.4. YOLOv4
Even though the overall structure of YOLOv4 [13] is similar to that of YOLOv3 (both versions use the same head), its detection precision is significantly higher, as demonstrated by Yang et al. in [43]. Recently, Jang et al. [44] have conducted a new study on the utilization of GPU resources, yielding best performance for YOLOv4, having dealt also with the correlation between the GPU and CPU.
YOLOv4 is not only better than YOLOv3 in terms of operational speed and accuracy, but also easier to implement. In addition, it exhibits a strong real-time detection ability, and can operate on a single conventional GPU, which is sufficient also for its training [29]. The backbone of YOLOv4 adopts the CSPDarkNet53 neural network [45]. During sampling, 3×3 convolutional layers (29 in total) and a 725×725 receptive field are used. In the neck, a Spatial Pyramid Pooling (SPP) module [46] is used to increase the receptive field and separate out the most significant context features, with almost no speed reduction. In addition, for parameter aggregation from different backbone levels, the neck utilizes a PAN instead of the FPN used in YOLOv3. In fact, the PAN adds an extra bottom-up pathway aggregation on top of the FPN [27]. The loss function in YOLOv4 is also changed, namely to a complete IoU (CIoU) loss [47], which considers simultaneously and in a more comprehensive manner the three important geometric factors, i.e., the overlap area, the distance between central points, and the aspect ratio, thus allowing to describe better the BBox regression, and achieve faster convergence speed and better regression accuracy. In addition, it also makes the detection model friendlier to small objects. The CIoU loss is formulated in [47] as: where measures the consistency of the aspect ratio between the prediction box and groundtruth, and is a trade-off parameter used to balance the scale, defined respectively as: where and ℎ denote the width and height of the real frame, and ℎ denote the width and height of the predicted frame, 2 ( , ) denotes the Euclidean distance between the predicted frame and real frame, and denotes the diagonal distance of the minimum closure area between the predicted frame and real frame. If the width and height of the real frame are similar to those of the predicted frame, then = 0, and the penalty term produces no effect.
The YOLOv4 structure is shown in Figure 7. Convolution (Conv), Batch normalization (Bn), and Mish activation function constitute the smallest component, denoted as CBM. Conv, Bn and Leaky_ReLU activation function form another component, denoted as CBL. The Cross-Stage Partial connections (CSP) component, consists of three convolutional layers and ResNet unit modules. All convolution kernels in front of the CSP are of size 3×3, equivalent to down-sampling [48]. SPP adopts the pooling operation of fixed blocks. The maximum pooling for the blocks with a kernel's size of 1×1, 5×5, 9×9, and 13×13 refers to a series of concat operations, tensor splicing, dimension expansion, and finally outputting.   Figure 7. The YOLOv4 model.

B.5. YOLOv5
Shortly after the appearance of YOLOv4, Ultralytics released YOLOv5, but it did not get the approval of the authors of the original YOLO model. To date, it remains controversial even though some of the initially provided (incorrect) performance evaluations were later corrected. Even though YOLOv5 trains faster, YOLOv4 can be optimized to achieve higher processing speed. So, the Darknet-based YOLOv4 is still the most accurate YOLO version, especially if a computer-vision engineer is in pursuit of state-of-the-art results and is capable of performing additional customization on the model [49]. That is why YOLOv4 was selected as the main YOLO representative for the performance comparison of models, presented in Section V.

IV. PROPOSED YOLOv4_AF MODEL
When the detection target's background is highly diversified and there are many kinds of objects in the image, the fast identification of the target object is the first problem to be solved. The spatial attention mechanism adopted by YOLOv4 for extracting the feature map information mainly focuses on the most informative part of the image, ignoring the extraction of the whole information during the image input. To enrich this operation, we propose to add a CBAM module to YOLOv4 for realizing a bidirectional attention on the global and local feature information. The introduction of a CBAM module in forward-propagation (Figure 8), allows to expand further the receptive field through channel training during the neural network transmission process and in combination with different weight coefficients to display the part that the detection network should pay more attention to.
In addition, we propose to select high-level features in the FPN part of the PAN, utilized by YOLOv4, and modify it accordingly, in order to improve the detection and classification performance of the proposed YOLOv4_AF model. To achieve this, we replace the residual connections in the network and insert additional neck layers as to better extract fusion features.
So, the novelty of the proposed model is in improving further the object detection and classification performance of YOLOv4 without deepening the neural network, by applying an appropriate optimization to it, as detailed further in the next subsections.

A. Using a CBAM
For automatic learning the places needing attention in photos / video frames and improving the intensity of feature expression of each channel, the proposed YOLOv4_AF model utilizes a CBAM module, as shown in Figure 8, which allows to increase the influencing factors of channel features and balance the interaction of the three dimensions of length, width, and height.
CBAM combines the channel and spatial dimensions to suppress the interference characteristics and compensate the information lost in the global average pooling, and then incorporates an attention mechanism. First, a channel attention mechanism directly averagely pools the information in the channel, with a wide processing range but rough comparisons. Then, a spatial attention mechanism processes the feature maps in the channels during the original image feature extraction stage.
The main CBAM computation is performed as per [6] (c.f. Figure 2): where ⨂ denotes element-wise multiplication, denotes the input feature map, ( ) denotes the channel attention map, ′ denotes the channel-refined feature, ( ′ ) denotes the spatial attention map, and ′′ denotes the final refined output.
In CBAM, the attention operation is performed in the channel dimension first. Given a convolution kernel = [ 1 , 2 , ⋯ , , … ], where represents its th parameter, the global average pooling operation and the global maximum pooling operation can be expressed as follows: where and denote the length and width of the feature map, respectively, and ( , ) denotes the points on the feature graph of size × , whose transverse and vertical coordinates are k and j, respectively. ( ) and ( ) are inputted into the first fully connected layer for dimension reduction. Then, the result, obtained by a Leaky_ReLU function, is inputted into the next fully connected layer, and the final result is outputted by a sigmoid function. The final output is formed by summing the two corresponding ouptputs together: = + .
(8) Next, the feature weighting operation is conducted through matrix multiplication and the channel feature = [ 1 , 2 , ⋯ , , … ] is obtained as follows: = × . (9) Further on, is inputted into the SAM part of CBAM for spatial attention extraction by means of global maximum pooling and global average pooling performed on the feature maps of all channels. Then, dimensions are reduced through convolution. Finally, the output and input feature dimensions are made consistent for the whole CBAM module.

B. Using a FPNi
The stronger the semantic information is, the more beneficial for the target object recognition it is. Here we propose a modification, named FPNi, to the FPN part of the PAN utilized by YOLOv4. In FPNi, the double route is expanded to a triple route with double side connections, so as to improve the performance of object detection and classification by combined operations.
In the FPN part of the PAN, utilized by the original YOLOv4 model, the bottom-top CNN and the top-bottom part are connected by the side, as shown in Figure 9. A 1×1 convolution operation is performed on the feature maps obtained by different ResNet convolution groups and the result is fused with the up-sampled feature map +1 to obtain a new feature map , which has the same size as the lower-layer feature map (the ⊕ sign in Figure 9 represents the element-wise addition operation of feature vectors). In order to solve the problem of multi-scale variation in YOLOv4, we propose a FPNi structure, shown in Figure 10 (the red rectangles there have the same structure as the red rectangle in Figure 9), which is utilized by the YOLOv4_AF model presented in this paper. In FPNi, there are additional fusion operations performed on 3 and 4 . The flow of feature map information is strengthened by the side linking of high-level semantics, and the functioning of the small target feature extraction is improved. At the final stage, after the regular 3×3 convolutional operations, an adaptive spatial fusion is performed before prediction. Thanks to the FPNi modification, the overlap effect during the up-sampling process is eliminated and the outputted feature maps combine the features of different layers and thus have richer information.
Maximum pooling is adopted during the down-sampling process. The input images are classified into several rectangular areas. The maximum value for each sub-area is outputted. The pooling layer imitates the human visual structure to reduce dimensions, abstract the data, and reduce the size of the image so as to match the size of the display area. This improves the robustness of the feature map, avoids overfitting, and retains more texture information. Through maximum pooling, the new image dimensions are calculated as follows: where 1 , 1 , and 1 denote the width, height, and number of channels of the input image, respectively, and ℎ denote the filter's width and height, and denotes the stride.
With the proposed FPNi design, the FPN operation is no longer limited by one-way information flow, and the high-level information with strong semantic can flow down through sideto-side connections, thus achieving better results than the original FPN design utilized by YOLOv4. However, the FPNi design increases the computational effort of the proposed YOLOv4_AF model, compared to that of YOLOv4, which has negative impact on the model's operational speed.

A. DATA SETS
Two public data sets, BIT-Vehicle [50] and UA-DETRAC [51], were used in the experiments. BIT-Vehicle consists of 9850 high-quality photos, containing frontal views of vehicles, collected under different conditions, such as bad weather, lighting, background confusion, etc. Six different classes of vehicles are included in the data set, namely Bus, Microbus, Minivan, SUV, Sedan, and Truck. The spatial resolution of photos is either 1920×1080 or 1600×1200 pixels. Each photo contains the image of one or more vehicles. In total, the photos contain the images of 558 buses, 883 microbuses, 476 minivans, 1392 SUVs, 5919 sedans, and 823 trucks. Some sample photos are shown in Figure 11.
The UA-DETRAC data set includes 10 hours of road monitoring videos made at 24 different locations at Beijing and Tianjin in China, totaling in more than 140,000 video frames with resolution of 960×540 pixels, containing 8,250 (different) vehicles divided into four classes -Bus, Van, Car, and Others. As most of the video frames, following one another, contain the same vehicles passing through the cameras, only 20,000 frames containing as much as possible different vehicles were used in the experiments.
A five-fold cross-validation was used. Accordingly, each data set was randomly split into five folds of equal size, resulting in 1970 photos for BIT-Vehicle and 4,000 video frames for UA-DETRAC. In each iteration, one of the five folds was used as a test set, whereas the remaining four folds were used for training the models. Five iterations were conducted in total on each data set to ensure each fold was tested. The obtained (averaged) results are reported in Subsection V.C.

B. METRICS
In the conducted experiments, the performance of the proposed YOLOv4_AF model was evaluated and compared to that of three state-of-the-art models, namely Faster R-CNN, YOLOv4, and EfficientDet, based on precision and recall, which are commonly used metrics for detection and classification problems. Precision is used to indicate the proportion of true positive (TP) samples in the prediction results, whereas recall is used to indicate the proportion of correct predictions in all positive samples, as follows: where TP represents the number of samples that are actually positive and are classified as positive, FP (false positive) represents the number of samples that are incorrectly classified as positive, i.e., the number of samples that are actually negative but are classified as positive, and FN (false negative) represents the number of samples that are actually positive but are classified as negative.
For the performance evaluation of the compared models, F1 score and mean average precision were used as the main metrics because they take into account both precision and recall.
The F1 score is defined as follows: 1 = (2 * * )/( + ). (15) The mean average precision (mAP) is a quantitative metric for evaluating the effectiveness of multi-class object detection [12]. It can be calculated as follows: where denotes the average precision of class and denotes the total number of classes. The average precision (AP) corresponds to the area under the precision-recall curve, i.e.: = ∫ ( ) ,

C. RESULTS
First, in the experiments, we used the BIT-Vehicle data set. The precision-recall curves were created for each of the compared models, for each of the six classes of vehicles, based on the values of recall and precision obtained from the conducted experiments for each fold. Then, these curves were used to calculate the average precision (AP) of each model for each class, separately for each fold, based on (17), as shown in Tables I-IV. Finally, in order to compare the overall performance of the models across all classes of vehicles, the mean average precision (mAP) values were calculated, based on (16), separately for each fold, and then averaged to obtain the final mAP result for the particular model, presented in Table V. These results confirm that the proposed YOLOv4_AF model outperforms, in terms of mAP, all three state-of-the-art models used in the comparison. Faster R-CNN is the most outperformed (by 9.96 points), YOLOv4 is in the middle (outperformed by 5.65 points), and EfficientDet is the least outperformed (by only 0.25 points).

83.45%
Then the other metric, F1 score, was used for the performance comparison of models. The obtained results, presented in Table  VI, confirm that the proposed YOLOv4_AF model outperforms the three state-of-the-art models on this metric too. More precisely, Faster R-CNN is outperformed by 0.102 points, YOLOv4 is outperformed by 0.007 points, and EfficientDet is outperformed by 0.050 points. Then, the UA-DETRAC data set was used for the comparison of models. First, the average precision (AP) of each model for each class of vehicles was calculated, as presented in Tables VII-X. Then, the mean average precision (mAP) values were calculated, as shown in Table XI. The obtained results confirm that the proposed YOLOv4_AF model outperforms, in terms of mAP, all three state-of-the-art models on this data set too. More specifically, Faster R-CNN and YOLOv4 are outperformed by a similar degree of 4.30 and 3.87 points, respectively, whereas EfficientDet is outperformed by only 0.74 points.
Finally, the F1 score values were calculated, as presented in Table XII, and these results also confirm that the proposed YOLOv4_AF model outperforms here all three state-of-the-art models on this metric as well. More specifically, Faster R-CNN and YOLOv4 are outperformed by a similar degree of 0.059 and 0.057 points, respectively, whereas EfficientDet is outperformed by 0.018 points only.
Although the proposed YOLOv4_AF model outperforms all three state-of-the-art models on both metrics and both data sets, its classification of some vehicles was not completely accurate. For some photos (video frames) with high similarity, the classification result was slightly biased, resulting in less confidence score.

VI. CONCLUSION
In this paper, a more precise vehicle detection and classification model, based on YOLOv4 with applied additional optimization, has been presented. The main idea, incorporated into it, is to increase the receptive field in both the channel and spatial dimensions by introducing an attention mechanism in the form of a CBAM module. In addition, in the FPN part, the feature fusion is modified, and an additional up-sampling operation is performed. Then, the output features are fused again, and the detection results of different layers are combined to improve the detection performance of the proposed model, called YOLOv4_AF. The performance of this model was experimentally evaluated and compared to that of the original YOLOv4 model and two other state-of-the-art object detection models, Faster R-CNN and EfficientDet, based on two public data sets -BIT-Vehicle and UA-DETRAC. The obtained results clearly demonstrate that the proposed YOLOv4_AF model outperforms all three state-of-the-art models, used in the performance comparison, in terms of the mean average precision (mAP) and F1 score, on both data sets. The elaborated model could be used also for detection of other types of objects, and can pave the way for improving the regression algorithms in general. However, due to the introduction of the CBAM module, compared to the original YOLOv4 model, the calculation complexity and time increase. In the future, we plan to carry out tracking of moving objects, conduct traffic statistics, and study the detection and classification ability of the proposed model on other objects.