T-YOLO: Tiny vehicle detection based on YOLO and multi-scale convolutional neural networks

To solve real-life problems for different smart city applications, using deep Neural Network, such as parking occupancy detection, requires fine-tuning of these networks. For large parking, it is desirable to use a cenital-plane camera located at a high distance that allows the monitoring of the entire parking space or a large parking area with only one camera. Today’s most popular object detection models, such as YOLO, achieve good precision scores at real-time speed. However, if we use our own data different from that of the general-purpose datasets, such as COCO and ImageNet, we have a large margin for improvisation. In this paper, we propose a modified, yet lightweight, deep object detection model based on the YOLO-v5 architecture. The proposed model can detect large, small, and tiny objects. Specifically, we propose the use of a multi-scale mechanism to learn deep discriminative feature representations at different scales and automatically determine the most suitable scales for detecting objects in a scene (i.e., in our case vehicles). The proposed multi-scale module reduces the number of trainable parameters compared to the original YOLO-v5 architecture. The experimental results also demonstrate that precision is improved by a large margin. In fact, as shown in the experiments, the results show a small reduction from 7.28 million parameters of the YOLO-v5-S profile to 7.26 million parameters in our model. In addition, we reduced the detection speed by inferring 30 fps compared to the YOLO-v5-L/X profiles. In addition, the tiny vehicle detection performance was significantly improved by 33% compared to the YOLO-v5-X profile.


I. INTRODUCTION
The ever-increasing city population has reached a point where the management of city resources has become a critical and important problem for large cities. In fact, to address the management of resources, the concept of a smart city has been coined for city resource data exploitation [1]. One of the biggest challenges of large cities is the improvement and enhancement of driving experience [2]: traffic control, surveillance, or parking guidance, which can help improve the mobility experience in these cities. Following this objective, one of the most time-consuming processes for drivers is finding a parking spot. A driver will travel extra kilometres per year to find an available slot, which not only has a direct impact on the driver's time consumption but also on environmental pollution [3]. In addition, in large parking, where the most desired spots [4] are usually concentrated, it will yield inefficient traffic, which will again contribute to worsening the problem.
This problem was managed using sensors in each parking spot to detect the spot occupancy. However, magnetometerbased parking sensors rapidly decrease the battery life with increasing accuracy requirements. In addition, modern vehicles often do not have ferromagnetic parts. Thus, the progress in computer vision and deep learning makes it possible to use smart cameras to control several parking spots and provide much cheaper solutions for the parking occupancy problem. In fact, the problem of finding available parking spots given an image has been addressed in several studies [5]- [8]. However, these techniques cannot be generalized, and even the adaptation of a specific solution to a different parking lot is not possible in many cases. Thus, vacant parking space detection based only on visual information remains a challenge for researchers. Most of these solutions use a simpler approach to determine the occupancy of each parking spot by detecting the availability of a parking spot through spot classification [9]- [11] instead of localizing vehicles using object detection techniques and determining if their positions are over parking spots [5], [7], [8]. Although these approaches can achieve good precision scores, they lack the possibility of extracting potential information about cars (i.e., road congestion, human interactions with cars, single car occupying two spots, etc.). Only by using a vehicle detection approach can we obtain information that cannot be obtained using a parking spot classification approach.
To obtain more reliable vacant parking space detection, the proposed classification solutions can be adapted to different vehicles ( e.g., cars, motorbikes, etc.). In addition, they should be capable of determining the locations of these vehicles in parking, as well as spot occupancy. In addition, these techniques should be adapted with different sizes of vehicles and with different fields of view, and they are robust to occlusions and lighting changes. All of these will facilitate the development of intelligent vehicle tracking systems.
Vehicle detection can be easily addressed using current state-of-the-art deep CNN models. Vehicle detection methods have been in development for several years in academia and industry. The main problems for vehicle detection are large variations in light, dense occlusion, and large variations in object scales. Most object detection models use generalpurpose datasets, such as COCO [12], ImageNet [13], and VOC [14]), where vehicle (i.e., cars) images are usually taken from a lateral or frontal view instead of a cenital plane. This makes them very good for general-purpose images, but they cannot be easily generalized to parking solutions.
An ideal solution for a parking occupancy system should be easily installed in the same camera. Thus, lightweight models are preferable over large models because they can be easily integrated with embedded systems with limited memory and computations. Although lightweight models usually have lower precision, they have a good inference speed. This inference speed gives the model a real-time feature that is highly desirable to be able to extract more information from the car and parking. Object detection models are usually classified into two families: one-stage and two-stage. The two-stage family is called regions with convolutional neural networks (R-CNN family [15]). In turn, SSD [16], and YOLO [17]- [20] are related to one-stage family detectors. The R-CNN family is a region-based detector that includes two stages. First, the model suggests a set of regions of interest (ROIs) using a regional proposal network. Because the potential bounding box candidates can be infinite, the proposed regions are sparse. Next, candidate regions were processed using a classifier. In turn, the one-stage family skips the region proposal stage and directly runs the detection over a dense sampling of possible locations. This yields a faster and simpler detection process but may potentially reduce the performance slightly. Thus, we propose the use of a one-stage detector as a baseline for the proposed vehicle detector. Consequently, this work aims to develop real-time and lightweight vehicle detection for different vehicle scales based on advanced convolutional neural networks (CNNs) for large parking with a particular camera with a cenital plane view and many parking spots. This camera configuration is highly desired because locating the camera at high heights with a proper view of the parking allows the monitoring of the entire parking with just one camera (as shown in Fig. 1). Moreover, using such a system will have the capability to extract extra information about the detected vehicles, such as color, brand, trajectory, in future upgrades, if desired. Such a system will feed more data into the smart city environment and allow more advanced data exploitation.
In this work, we present a deep learning model to improve object detection for small and tiny objects, specifically, cars. Our contributions are: • Introducing a new first layer in the state-of-the-art model, that is, YOLOv5. This new first layer was used to extract discriminative features from different scales. The proposed work suggested replacing the focus layer of YOLO-v5 with a multi-scale layer based on an Efficient Neural Network (ENet) [21]. The introduction of this new layer in YOLO-v5 resulted in a significant outperformance of the baseline model. • To avoid redundant low-level features in the backbone of YOLO, we assess the impact of spatial and channel information using attention modules on the performance of tiny object detection. The proposed attention modules can capture meaningful information of the "where" and "what" for tiny object detection. Although its impact is not as large as that of the new multi-scale layer, the use of attention modules improves the performance of the proposed model by a minimal margin. The rest of this paper is structured as follows: Section II presents the proposed methodology. In Section III, different experiments are performed and discussed. Finally, Section IV concludes the study.

II. METHOD
The YOLO network has the advantage of being much faster than other networks of the one-stage family. Moreover, it achieved comparable results to the state-of-the-art and still maintained accuracy, and its predictions depended on the global context of the input image. Consequently, our proposed model is based on the Yolo architecture as a baseline.
The architecture of the YOLO network contains many layers that connect with each other. The operations performed in each step can summarise the YOLO-v5 network into three different sections. The first section, the backbone (i.e., called cspdarknet [22]), is composed, in the case of Yolo-v5, of the most common operations in CNNs (e.g., convolutions, concatenations, max-pooling) and a simple forwarding mechanism that is constructed to extract multiple features for the next section. The concept of the backbone is a common and old topic in multiple deep learning networks of object detection and is used as a simple base network. For instance, in SSD [16], the common VGG-16 network [23] is used as the backbone. The cspdarknet network helps the YOLO model to have sufficient ability to learn the complex features of the input images. In addition, the cspdarknet network copes with the problems of repeated gradient information in large-scale backbones. It also integrates the gradient changes into the feature map, yielding a significant reduction in the trainable parameters and floating-point operations per second, which increases the inference speed and accuracy.
In the next section, the neck is responsible for adding and mixing all the different features computed across all convolutions in the backbone and preparing them to feed into the head section. Yolo-v5 applied an improved PANet [24], named bi-directional feature pyramid network (Bi-FPN) as its neck, in order to allow easy and fast multi-scale feature fusion. Bi-FPN introduces learnable weights, enabling the network to learn the importance of different input features, and repeatedly applies top-down and bottom-up multi-scale feature fusion.
Finally, the head section on the other hand is composed of convolutional layers for bounding boxes and classes predictions. Yolo-v5 integrates a compound scaling method that uniformly scales the resolution, depth, and width for all: backbone, feature network, and box/class prediction networks at the same time, which ensures maximum accuracy and efficiency under limited computing resources.
In general, most of the detectors failed to detect tiny objects properly. The results in section III show that the precision of the YOLO network on tiny objects is very low compared to the benchmark based on COCO. Thus, in this paper, we adapt the YOLO-v5 model to be more efficient with small and tiny objects (i.e., vehicles).
In the last released version of Yolo [20], the Focus layer is one of the important modules of Yolo-v5. As shown in Fig. 6-(a), the Focus layer first copies the input image size (e.g., 3x256x256) to four copies. The four copies were then sliced into four slices by sampling with a step size of 2 (i.e., 3 × 128 × 128). The four slices are then concatenated in-depth with an output of 12 × 128 × 128, and then passed to the next convolutional layer with 32 kernel filters to generate an output of 32 × 128 × 128, and the result is fed into the next convolutional layer through batch normalization and RELU as an activation function.
Based on our experiments on the pre-trained models of Yolo-v5, we suspect that the Focus layer is not capable of properly extracting spatial information for tiny objects that degrades the model performance. Thus, we propose two different mechanisms to enhance spatial and channel information extraction for tiny object detection: 1) The first mechanism is to use channel and position attention modules [25], as shown in Fig. 6(b and d), respectively. 2) The second mechanism involves substituting the Focus layer with a multi-scale module (MSM), as shown in Fig. 6(c, d, and e).
To stack the RGB information, the Focus layer shown in Fig. 4 splits the image and translates spatial information into nonqualitative features. Although it is a quick operation and greatly reduces the inference time to fully utilize GPU operations, it was created to translate spatial information into depth information [26] by simply stacking the quarters of the image. However, Focus layer does not properly represent small and tiny spatial features. A good way to enhance tiny spatial features is to use multi-scale convolutional operations, such as that proposed in [21], [27]. By up-sampling the input image rather than down-sampling, the deep network can be better adapted to anchors, even with too tiny objects. The MSM shown in Fig. 3 is composed of three branches, which up-samples the input image to multiple scales (i.e., in this work we used x1,x2, and x4 scales) using a bilinear interpolation approach and then each scaled image is fed to an initial block of ENet's [21] (Fig. 2). The ENet model can VOLUME 4, 2016 be run on embedded boards because it is a very light model and is more applicable to mobile robotics systems. ENet's initial block uses a concatenation of two parallel operations. The first 13 filters of a 3 × 3 convolution of stride 2, and the second is a max-pooling operation on the input image. The concatenation of the two branches resulted in a tensor of 16 feature maps. Indeed, in most deep learning models, the pooling operation is achieved after a convolution to increase the feature map depth; however, it is computationally expensive. Therefore, as proposed in [21], we chose to perform a pooling operation in parallel with a convolution and concatenate the resulting feature maps. This technique allowed us to speed up the inference time of the initial block 10 times. To reduce the computational cost of 2D convolution filters, we used a 1-D kernel factorized convolution that is 1 × k followed by k × 1. The neuron's receptive field as the patch of the total field of view (i.e., k = 3,5,7) is separately defined for each scale. Consequently, similar to the focus layer, this adaptation allows us to avoid the trainable parameters of YOLO-v5 to grow [20]. At the same time, this provides the first-layer potential to learn features on multiple scales. The resulting tensor is downsampled again to the x1 scale (i.e., the original size of the input image), resulting in three branches with an original scale tensor that are merged together using a combination operation. Such an operation typically computes the average activation of the corresponding units in each branch. The motivation behind this choice is to be able to up-sample the input and have a trainable layer while keeping the layer parameters at a minimum. By scaling the image with several factors, the method can enhance the spatial information of tiny objects while maintaining a relatively low number of parameters. In this way, the network should be able to locate features on the enlarged objects.
However, the use of a multi-scale approach may tend to feed redundant information in likeable low-level backbone features. Furthermore, the contextual information of the MSM branches may be different, degrading the performance of pixel-wise recognition. To overcome such problems and refine the input feature, we propose spatial-channel attention modules (SCAM) [25], as shown in Fig. 5, that a framework that resolves the weakness of using the multi-scale approach in the detection task. On the one hand, the Spatial Attention Module (SAM) is a module that utilizes the interspatial relationship of features (i.e., in our case, vehicle or no vehicle). Unlike channel attention, spatial attention focuses on where is an informative part. To compute the spatial attention, the original feature map is reinterpreted with two different convolution matrices, and reshaping and transpose operations are applied to the results to multiply both matrices and extract an NxN (HxWxHxW) spatial attention matrix that defines the position of the image where the information is high. This matrix is used to perform an elementwise sum operation with another convolutional feature map from the original one to obtain a spatial-weighted original feature map. On the other hand, the channel attention module (CAM) exploits the inter-channel relationship of features. Since each channel of a feature map is considered as a feature detector, channel attention focuses on 'what' is meaningful given an input image. To achieve this, CAM uses a similar approach to SAM, in which there are no new feature maps or the reshape-multiplication target. Instead of obtaining a HxWxHxW matrix, we obtain a CxC matrix, thus focusing on which elements are important rather than where is the information. The fused feature representation of the SCAM is obtained by adding the space-wise representation of the SAM and channel-wise representation of the CAM. Every spatial attention map is summed back to the channel-attention tuned feature maps for adaptive feature refinement. The attention mechanism can be directly adapted to any feature representation problem, and it encourages the network to capture rich contextual relationships for better feature representations. In this study, we take advantage of this finding to create an SCAM that collects context information from all pixels to adaptively recalibrate the spatial and channel responses of the objects (i.e., vehicles) in a resulting convolutional feature map.
For the YOLO-v5 adaptation, we used different variations:

III. EXPERIMENTS 1) Dataset
PKLot footnote urlhttps: //web. inf.ufpr. br/vri/databases/parkinglot-database/ is the dataset used to train and validate the proposed model. This dataset is a folder structure based on the parking location, meteorological states, and specific days. From all the images, we only used a subset consisting of the cenital plane (PUCPR). The 4474 images have almost 100 parking spots tagged with the location of bounding boxes and occupancy for a total of 424269 tagged spots. These spots were randomly distributed to 80% and 20% for the training and testing sets, respectively. To prepare the dataset for training and validating the proposed model, we applied the following procedure: • Since each occupied parking spot is used by a car and the cars are within the parking spot, we used these occupied spots annotation as the localization (i.e., bounding box) and the class (i.e., vehicle or no vehicle). • Since there are more cars present in the image than those tagged, we applied a mask to only include the area where the tagged cars found and exclude nontagged cars. In short, we apply a region of interest to our monitorised/tagged area. • We adapted and translated PKLot format annotations to coco format annotations to be able to use easily with the YOLO-v5 model.

2) Evaluation Metrics
Several metrics have been used to assess the performance of deep-learning detection models. Precision (P) is the proportion of True Positive among all Positive detected: P = T P T P + F P Recall (R) is the proportion of detected Positives among all ground true positives:

R =
T P T P + F N Here, mAP.5 and mAP.95, representing the mean Average Precision of all detections with an Intersect of Union (IoU) of 50% and 95%, respectively, where the IoU is the result of obtaining the intersection of both bounding boxes (detected and ground truth) normalized by the union of both bounding boxes. The Average Precision (AP) is then computed based on the detection of a given class with an IoU greater than 50% or 95%. Finally, the mean Average Precision (mAP) was computed using the average for all the classes.
In addition, some other validation errors for bounding boxes (Box), (Obj), and (Cls) were calculated, as shown in Table 2, where the Box error is computed using IoU, as a result of the intersection of the predicted and groundtruth normalized by its union. The Obj error is the objectness score, which is used to compute the likelihood that a specific bounding box is an object. The Cls error corresponds to a multi-classification score. The Obj and Cls errors are computed using the Focal loss function, which is an extension of the cross-entropy loss function that would downweight easy examples and focus training on hard negatives.
Furthermore, there are other metrics based on the efficiency of the model, such as the inference speed, which normally refers to frames per second (FPS), and the number of parameters that is normally a good indicator of the model complexity.

3) Data Augmentation
The dataset images were augmented using YOLOv5 standard augmentation 1 following different techniques: mosaic/mixup [28], letterbox, perspective, HSV color space, and flipping (up-down, reight-left). No changes were made to avoid any bias in the results and to check the implications of the MSM and attention modules outside the data augmentation methods.

4) Implementation
The experiments were conducted in a Ubuntu 18.04 system with a GPU of Nvidia 2080-Ti and a Pytorch library 2 . We used the available Pytorch-based implemented code of YOLO-v5 3 as a baseline. This baseline also implies using hyperparameters and configurations as in Yolov5 (as in hyp.scratch.yaml), most destacable:

5) Results and Analysis
First, to show the effects of different improvements in the baseline Yolo-v5 model, we performed an ablation study on our proposed model with different variations. In this ablation study, we analyzed the effect of the addition of the two mechanisms of multi-scaling and attention models on the performance of a baseline model, YOLO-v5. In Table 1, we present the variations in the proposed network. We changed the proposed architecture of the MSM by adding an attention network to the multi-scale branches or to the backbone, in addition to using different types of pooling. First, we assessed the baseline YOLO-v5 model by validating the PKLot dataset into both a COCO-trained YOLO-v5 model and a PKLot fine-tuned YOLO-v5 model. In the first two experiments, no modifications were made to the model, as shown in Fig. 6-a. We used four profiles of the YOLO-v5 network: small (YOLO-v5s), medium (YOLO-v5m), large (YOLO-v5l), and extreme (YOLO-v5x). Note that among the four profiles, the s profile is the fastest model in terms of speed and the smallest models in terms of size, and the x profile is the highest in size and the lowest in speed.
As shown in Table 2, as expected, the COCO-trained YOLO-v5s and YOLO-v5m models yielded the worst results in terms of R and mAP. However, the fine-tuned YOLO-v5m achieved a much better performance in terms of R, mAP0.5 and mAP.95 compared to the COCO-trained baseline models. The R and mAP values improve drastically from R = 0.261 and mAP.5 = 0.562 with the COCOtrained YOLO-v5m model to R = 0.995 and mAP.5 = 0.9938 with the fine-tuned YOLO-v5s model. However, the P value was significantly reduced by 10%. We can explain 1 hyp.scratch.yaml 2 https://pytorch.org/ 3 https://github.com/ultralytics/yolov5 the reduction in Precision, P , compared to the significant improvement in R and mAP values, because the COCOtrained model is overflowing with candidates for the targets. Using qualitative results, we can see the differences between the COCO-trained YOLO-v5 model in Fig. 7, and the finetuned YOLO-v5 model in Fig. 8. It can be seen that not only does the COCO-trained model miss the detection of several cars in the input images, but it also misclassifies them by detecting cell phones (blue) instead of cars (orange). In the fine-tuned YOLO-v5 model, YOLO-v5 can correctly classify the objects in the input images, although we still observe some false positives (i.e., out of the region of interest of the monitored area).
Second, we add the SCAM blocks to the baseline YOLO-v5 model to assess the performance of the adapted variation. To minimize the parameters of the model, we used YOLO-v5s (s profile). In particular, we tested three different tension modules between the backbone and neck sections of Yolov5: spatial attention module (Y olov5_SAM _backbone), channel attention module (Y olov5_CAM _backbone), and channel-spatial attention module (Y olov5_SCAM _backbone). As shown in Table 2, adding the SAM block only gained a marginal and a small improvement of (+0.1%to+0.4%) on Recall compared to the fine-tuned model Yolov5s model, which yields a reduction   Yolov5_MSM of 1% in the precision values. Thus, we can say that adding SAM, CAM, or even SCAM does not yield a significant improvement in true positive detection compared to the finetuned YOLO-v5 models. Third, we replaced the focus layer of YOLO-v5 with the MSM layer (i.e., named Y olov5_M SM ) ( Fig. 6-b)). As we can see in Table 2, there is a significant improvement in the Precision values outperforming the Focus layer from 0.6273 of Yolov5s to 0.9203 in Y olov5s_M SM ). Although it is slight, there is also a small improvement in the other performance values of R, mAP , Box, and Obj, except for the Cls loss. Qualitatively, Fig. 9 shows the same example of the PKLot dataset shown in Fig. 7 with a much better detection rate. Adding the MSM block to YOLOv5 was able to properly detect and classify cars in the images. Fig. 9 also showed the the ability of the Y olov5_M SM model in localizing and classify the cars even in a crowed scenario. Indeed, the substitution of the Focus layer for an MSM in the YOLO-v5 model provides an efficient feature representation for tiny cars (objects) present in PKLot.
After the replacement, we added the SAM, CAM, and SCAM blocks to the MSM blocks to observe if we could benefit from the marginal improvement of SCAM. The Y olov5s_M SM model is modified in two differ- ent ways. First, we added attention blocks, as in the second experiment (Fig. 6-c), between the Backbone and Neck using SAM (Y olov5_M SM &SAM _backbone), CAM (Y olov5_M SM &CAM _backbone) and channelspatial (Y olov5_M SM &SCAM _backbone) modules. In addition, we integrated the attention blocks (i.e., SAM, CAM, and SCAM) into each branch of the MSM (Fig. 6-d) for spatial (Y olov5_M SM &SAM ), channel (Y olov5_M SM &CAM ) and channel-spatial (Y olov5_M SM &SCAM ). The combination of the MSM and attention modules yields an improvement from 3% to +4% in the Precision values. Among the six variations, the (Y olov5_M SM &CAM _backbone) model yielded the best precision value with an improvement of more than 4, 31% compared to Y olov5s_M SM . However, Y olov5_M SM &CAM yields results comparable to Y olov5_M SM &CAM _backbone with an improvement of +4.0 in the Precision values. Thus, we can note that MSM helps YOLO-v5 to find discriminative spatial features, and the CAM/SCAM attention module helps the network to enrich the extracted features with discriminative multichannel features. Again, the variations in the other values (R, mAP, and the different errors) are marginal and likely not significant. A comparison between the baseline YOLO-v5 model and its variations is shown in Fig. 10 that clarifies the ability of the proposed adaption on the detection rate.
Besides, Fig. 10 shows that the integration of MSM and CAM with the YOLOv5 model helps to maximize the overlap of predicted bound-boxes versus actual bounding boxes with IoU > 0.95 for tiny car detection.
However, in order to fully exploit the dataset knowledge and adapt our model to the dataset and obtain the best model for that application, we suspect that the use of a multiscale of three branches might not be the most efficient network that fits this dataset. The size of the cars on the datasets, although tiny, does not change, so it might be better for this application to use an MSM of just one branch or even two branches. For this, from the MSM three branches (x1, x2, x4 being the scale used on the branch), we experiment on both a com-  However, the use of a pooling layer on MSM is controversial, and we use only 13 convolution features to keep the layer light and three more features from pooling. With these low numbers of convolution features to extract spatial information, one may think that pooling has no effect or even a detrimental effect. To determine if this pooling actually affected the performance of the model, we checked the pooling layer in order to see if there was improvement by using a pooling layer. Considering that Y olov5_M SM uses average(avg) pooling, we trained the model without attention modules again 1) without any pooling and 16 features of convolution (Y olov5_M SM _N oP ool), 2) using max pooling instead of avg pooling (Y olov5_M SM _M P ool), and 3) using 10 layers of convolution and six layers from pooling (max and avg pooling) (Y olov5_M SM _M AP ool). The results show a marginal improvement of 1 or 2% if we use none of the pooling. Finally, we check that this VOLUME 4, 2016  marginal improvement can be used to further improve our best models by testing max and avg pooling (best pooling performance) with the best models. As seen in Table. 2, both models (Y olov5_M SM _M AP ool_CAM _backbone and Y olov5_SM _x2_M AP ool) worsen their performances about 4%. Because YOLO-v5 is a deep detection model that focuses on low parameters and high frames-per-second (fps) inference to be fit in low-end terminals, we must ensure that the changes in the model do not significantly affect this aspect. Thus, Table 3 shows the different evaluations of the different variations of the YOLO-v5 model. The trainable parameters of the Y olov5s_M SM and Y olov5s_M SM _CAM models have a comparable number of parameters, even fewer parameters compared to the Y olov5s model. However, the Y olov5s_M SM model does deteriorate in fps values, but the achieved fps value is maintained within YOLO-v5 larger profiles, very close to the fps values resulting from the Y olov5l profile. The Y olov5s_M SM model can achieve almost 30 fps, which yields a reduction of 5 fps compared to the Y olov5s model. In addition, the addition of attention modules did not significantly affect the fps values, maintaining approximately 30 fps. In fact, by using the single-branch solution (Y olov5_SM _x2), we can outspeed Yolov5m from 35fps to 42 fps, achieving a much greater precision. Thus, we recommend the use of the adapted version of Y olov5s_M SM _CAM with a high Precision value of 96% and 30 fps for use as an industrial detection model for large parking, or Y olov5_SM _x2 if speed is required.

IV. CONCLUSION
In this paper, we proposed a reliable modification of the YOLO-v5 model targeting tiny car objects from a cenital view. Using a multi-scale module and channel/spatial attention mechanisms, the modified version outperformed the original YOLO-v5 for this specific application with a precision of up to 96,34% compared to a precision of 63,87% with the baseline YOLO-v5 model. The proposed model also slightly improved the Recall and mAP values while maintaining the same number of trainable parameters. In addition, the adapted variation yields a decrease in speed compared to the small and medium profiles of the YOLO-v5 network, although it exceeds the large and extreme profiles. On the other hand, the single-branch solution outspeeds the Yolo-v5 small profile and was almost as precise as the multi-branch solution. Ongoing work aims at developing a reliable tracker based on the developed detector. Future work aims to use the developed detector and tracker in low-end terminals, such as a Field-Programmable Gate Array (FGPA) or an NVIDIA Jetson Nano Developer Kit for real-time parking monitoring.