An Improved Lightweight Yolo-Fastest V2 for Engineering Vehicle Recognition Fusing Location Enhancement and Adaptive Label Assignment

Engineering vehicle recognition based on video surveillance is one of the key technologies to assist illegal land use monitoring. At present, the engineering vehicle recognition mainly adopts the traditional deep learning model with a large number of floating-point operations. So, it cannot be achieved in edge devices with limited computing power and storage in real time. In addition, some lightweight models have problems with inaccurate bounding box locating, low recognition rate, and unreasonable selection of positive training samples for the small object. To solve the problems, the article proposes an improved lightweight Yolo-Fastest V2 for engineering vehicle recognition fusing location enhancement and adaptive label assignment. The location-enhanced feature pyramid network structure combines deep and shallow feature maps to accurately localize bounding boxes. The grouping k-means clustering strategy and adaptive label assignment algorithm select an appropriate anchor for each object based on its shape and Intersection over Union. The study was conducted on Raspberry Pi 4B 2018 using two datasets and different models. Experiments show that our method achieves the optimal combination in speed and accuracy. Specifically, the mAP50 is increased by 7.02% with the speed of 11.24 fps under the engineering vehicle data obtained by video surveillance in a rural area of China.


I. INTRODUCTION
W ITH the development of economy, urban population, and transportation level, Chinese urbanization has entered a stage of rapid expansion. This will lead to dramatic changes in Chinese urban land use and increasingly prominent contradiction between people and land. In recent years, some illegal land use behaviors have occurred frequently, especially the illegal construction of cultivated land, forest land, and river. In addition, coupled with the serious national conditions of land pollution and soil erosion, we must cherish land resources and strengthen the fight against illegal land use. Engineering vehicles (such as trucks and excavators) are important tools for illegal land use. Therefore, identifying engineering vehicles from video is a useful way to monitor illegal land use. The traditional method identifies engineering vehicles manually through surveillance video by staff [5], a time-consuming and labor-intensive process. With the rapid advancement of deep learning, automatic image recognition based on the convolutional neural networks (CNN) has become widespread in traffic management [6], [7], [8], [9], crime prevention [10], [11], [12], anomaly detection [13], [14], [15], [16], fire warning [17], human detection [18], [19], [20], [21], [22], and agricultural management [23], among other fields. This also enables the identification of engineering vehicles from video in real time. However, the real-time video captured by the online camera must be uploaded to a cloud or local server on which the deep learning recognition model has been deployed [24], [25]. This process is affected by network bandwidth, preventing data transmission in real time. Not only are a significant number of operations concentrated on the server but the server's performance requirements are also relatively high. Consequently, in the process of video recognition, deploying computing power at the front end is a crucial way to alleviate the pressure of data transmission and central server computing. Some researchers have proposed smart devices that advance computing tasks to the network edge with the advent of edge computing [26], [27]. The object recognition model is directly deployed in the edge device, and the edge device's computing resources are used to perform a portion of the computing tasks required for object recognition [28], [29], thereby enhancing realtime recognition and relieving the burden on the central server.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ However, because current mainstream deep learning object recognition models require large floating-point operations (FLOPs) [30], [31], [32], real-time recognition cannot be achieved in edge devices with limited computing and storage resources. Therefore, developing a lightweight model for object recognition in land and resource management has become a popular research direction [33], [34], [35].
Currently, some researchers are investigating lightweight object recognition models. Nikouei et al. [36] developed a lightweight human detection model L-CNN using depthwise separable convolutions and only 23 layers. The edge device Raspberry PI 3 Model B has an average recognition speed of 1.79 frames per second (FPS) for 224 × 224 video images. Even though this model has a small number of parameters, the computation required is still substantial, and real-time detection cannot be achieved on devices with limited computing resources. Steinmann et al. [37] replaced the SSD detector's backbone network with a lightweight backbone network PELEE and combined the filter pruning strategy to reduce the number of parameters. On Jetson AGX, the recognition speed is 24 times faster than the original model, but the recognition accuracy suffers. Based on YOLO-Lite as the backbone network, Zhao et al. [38] added a residual block, balanced the high-and lowresolution subnetworks, and proposed the mixed YOLO V3-Lite lightweight model. Using Jetson AGX Xavier equipment, the speed of recognizing 224 × 224 video images can reach 43 FPS. Still, the number of parameters is substantial, reaching 20.5 MB. Balamuralidhar et al. [28] introduced the MultEYE network in the UAV traffic detection system MultEYE. The network implements pruning to reduce the number of CSP Bottleneck and convolution blocks and adds a space-to-depth layer to decrease the space size and increase the number of channels. The speed of recognizing 512 × 320 video images on NVIDIA Xavier NX can reach 29 FPS. The head, neck, and backbone of the lightweight model of RangiLyu [39] proposed an anchor-free Nanodet model, which can recognize data with a size of 320 × 320 on a mobile ARM CPU at a speed of 97 FPS. However, the anchor-free model has significant issues with positive and negative sample imbalance and semantic ambiguity, leading to the phenomenon of missing objects. Amudhan and Sudheer [40] proposed a lightweight model PmA that can detect small objects in remote sensing images effectively and quickly by extracting more feature information from shallow features and transferring shallow features to deep features for detection. Compared with YOLO V3-tiny and YOLO V4-tiny, Jetson Nano's detection speed is 32% faster. Liu et al. [41] addressed the issues of limited UAV computing resources and too small objects by enhancing the network structure and target prediction frame selection algorithm of YOLO V4. This enhanced the recognition accuracy. Achieving 23.8 FPS on Nvidia Jetson TX2 and 9.6 FPS on Raspberry Pi 4B. Dog-qiuqiu [42] proposed the Yolo-FastestV2 model for detection in real time on mobile devices. The model is intended to replace the backbone network of YOLO V5 with ShuffleNet V2 while simultaneously lightening the feature pyramid network (FPN) structure; the parameter size of the model is just 237.55 kB. On the Mate 30 Kirin 990 CPU, the image recognition speed of 352 × 352 can reach more than 300 FPS. The real time detection effect can also be achieved on embedded devices with limited computing resources. However, Yolo-Fastest V2 lacks rich feature fusion, which prevents it from making full use of the rich location information in shallow feature maps and from selecting suitable positive training samples effectively. This results in insufficient extraction of location and semantic features of small objects, which leads to inaccurate positioning of the bounding boxes (bboxes) of small objects and a low recognition rate.
Yolo-FastestV2 is a widely used lightweight object recognition model. This article proposes a method of fusing locationenhanced FPN and adaptive label assignment to address the issues encountered by Yolo-FastestV2. This method improves the FPN structure by fusing feature maps of different scales to ensure that the object's location and feature information are fully extracted. A grouping k-means clustering strategy and an adaptive label assignment algorithm are proposed to assign appropriate anchor boxes to each object to combat the issues of low recognition rate and unreasonable selection of positive training samples. The algorithm assigns each object in the image an anchor with a similar shape and the largest Intersection over Union (IoU). In other words, a suitable positive training sample is selected to facilitate the regression of the bbox.
In summary, the following are the article's contributions. 1) Feature fusion is added based on the original model FPN. Integrate the shallow feature map rich in location information with the deep feature map channel rich in semantic information to enhance the localization ability of the model. 2) Introducing a threshold α and proposing a grouping kmeans clustering strategy. According to the adjustment of the α value, multiscale anchors are generated for the small object. So, the small-scale anchor box can be set reasonably to make the small bbox easier to learn. 3) An adaptive label assignment algorithm based on shape and IoU is proposed. Based on the shape label assignment, an IoU label assignment algorithm is introduced. The shape threshold and IoU threshold are dynamically calculated by calculating the mean and variance of IoU and the mean and variance of the ratio of the shape. Under the dual constraints of shape and IoU, an appropriate anchor box is assigned to each object.

II. PROPOSED METHOD
As shown in Fig. 1, the model consists of three components: the backbone network ShuffleNet V2, the FPN, and the prediction head. This article improves the Yolo-FastestV2 model in three aspects: FPN structure, anchor box settings, and label assignment.
In recent years, ShuffleNet V2 has been an excellent network for lightweight feature extraction. It consists of three stacked Shufflev2Block convolution blocks. The entire network ensures optimal performance for feature extraction by minimizing memory access. The location-enhanced FPN structure (red dotted-line) will integrate the location information of the shallow features into the deep features to improve the model's ability to locate the bbox of a small object. In the label assignment algorithm of the prediction header (blue dotted-line), the grouping k-means clustering strategy generates reasonable anchors for small objects, while the adaptive label assignment algorithm matches appropriate anchors for each object. Details are as follows.

A. Location-Enhanced FPN
In Fig. 2(a), the original FPN structure predicts small objects using the feature map obtained by convolution of the third ShuffleV2Block convolution block in the ShuffleNet V2 network structure, in conjunction with 1 × 1 convolution. Then, the feature map obtained by the 1 × 1 convolution is upsampled. The feature map obtained by the second ShuffleV2Block convolution block is subjected to concatenation feature fusion. Finally, the 1 × 1 convolution is performed to predict the object whose scale is greater than that of the small object. Thus, FPN utilizes only two layers of shallow feature maps, lacks rich object location information, and cannot fully extract semantic information. This will result in inaccurate center coordinate positioning when the model predicts a bbox of small object.
This article proposes a location-enhanced FPN structure [see Fig. 3(b)], which primarily contributes to feature fusion by introducing 1/8 times the shallow feature map obtained by the first ShuffleV2Block convolution block into the deep feature map.
For medium-and large-scale objects, the feature map obtained from the first ShuffleV2Block convolution block is downsampled by a factor of 4 to obtain C1. Second, obtain C11 by performing concatenation feature fusion on C1 and the feature map obtained by the third ShuffleV2Block convolution block. C11 is subjected to 1 × 1 convolution to predict objects of medium and large scale.
For small-scale objects, the result after C11 convolution is first upsampled by two times to obtain C13. Second, C2 is obtained by downsampling by a factor of 2 the feature map obtained by the first ShuffleV2Block convolution block. The feature map obtained by C2 and the second ShuffleV2Block convolution block is fused by concatenation to obtain C21. C21 incorporates deep semantic information. C22 is obtained after 1 × 1 convolution. After each concatenation feature fusion operation, 1 × 1 convolution is used because during the whole model training process, the convolution operation can extract the effective feature information of the previous feature map and weaken the noise information. Finally, C22 and C13 perform addition feature fusion to predict small objects. Through the method of addition feature fusion, shallow and deep features are fully fused, resulting in a fused feature map with rich object location information that improves the original model's ability to locate bboxes. This structure enables the prediction of large and small objects to obtain the fusion of shallow features with rich location information, thereby improving the locating accuracy.

B. Grouping k-Means Clustering Strategy
For anchors-based object recognition, appropriate anchors must be set before label assignment. However, the object's size in the image is either large or small. If we directly apply k-means clustering to the bboxes of the training set, it is simple to cluster the bboxes of some very small objects into a wide range of anchors [see Fig. 3(a)]. This is due to the influence of threshold k in the k-means method, which results in assigning such anchors to small objects [see Fig. 3(b)]. Ultimately, the anchors cause harmful gradient values to be passed back.
This article proposes a grouping k-means clustering strategy in response to the issue. As shown in Fig. 4, this strategy includes a threshold α. Then, compare the width and height of the object's ground-truth (GT) with the image's width and height. The GT whose minimum value of this ratio is less than the threshold  The threshold α is used as the demarcation point between the small object and the medium object. It can adjust the threshold α by analyzing the relative width and height distribution of the dataset object so that the small objects form a single set. The k-means clustering is performed on this set to obtain three anchors of different scales, allowing each small object to be assigned a more reasonable label. For the setting of m and n values, models from the Yolo series [26] usually cluster into three types of anchors for various object levels. Therefore, this article continues the Yolo method and sets n and m to 3, respectively.

C. Adaptive Label Assignment
After acquiring appropriate anchors, combine shape label assignment (see Fig. 5). The object is matched with the appropriate anchor, which is then assigned to cls and reg labels. However, the original method of shape label assignment causes two issues: First, the fixed shape threshold cannot guarantee that each object matches an anchor. In other words, the selection of positive training samples is unreasonable. For example, objects larger than the shape label assignment threshold at different levels [see Fig. 6(a) and (b)]; second, it is possible to match multiple anchors with similar shapes, among which the anchors with  low IoU participate in the backpropagation, which increases the complexity of model training.
This article proposes an adaptive label assignment strategy inspired by the ATSS algorithm [43]. The strategy is divided into shape similarity and maximum IoU. In terms of shape, first, at the same level, calculate the mean and standard deviation of the aspect ratio (r w , r H ) of each GT and anchors. Calculating the sum of the mean and standard deviation yields the upper boundary constraint value for each object's various anchors. As shown in Fig. 7, this constraint guarantees that each object is assigned to at least one similarly shaped anchor. The shape similarity threshold is calculated as follows: in (1)- (5), H i gt represents the height of the ith object GT box; H j anchor represents the height of the jth anchor; W i gt denotes the width of the ith object GT box; W j anchor represents the width of the jth anchor; d ij W denotes the ratio of the width of the jth anchor of the ith object GT box; d ij H represents the ratio of the height of the jth anchor of the ith object GT box; m represents the number of anchors; r ij denotes the maximum value of the ratio of the width and height of the jth anchor of the ith object GT box; μ i represents the r-mean of the ith object GT box; σ i denotes the r-variance of the ith object GT box; α i shape represents the shape threshold of the ith GT box.
In terms of IoU, determine the mean and standard deviation of the IoU for each object's GT box and anchors at the same level. The sum of the mean and standard deviation is used as the lower boundary constraint value of each object's GT box's IoU. This constraint combines the results of the shape similarity constraint to determine the participating positive sample anchors in training. If the anchors assigned based on shape similarity do not meet the adaptive IoU threshold, they are included in the label assignment for the following level. This ensures that each object has a corresponding anchor. The lower bound threshold is calculated as follows: in (6)-(8), IoU ij represents the IoU of the ith object GT box and the jth anchor; μ i IoU denotes the mean of all IoUs for the ith object GT box; σ i IoU represents the standard deviation of all IoUs for the ith object GT box; α i IoU denotes the IoU threshold for the ith object GT box.
In a lightweight network structure, too many layers of pyramid structure will affect the recognition speed, which is not conducive to embedded deployment and operation. The proposed method adapts anchors at different scales for each object at the same feature scale level. Therefore, the method is also applicable to network structures without FPN. The proposed method is at the multiscale anchor level based on the relationship between  I  MODEL TRAINING ENVIRONMENT   TABLE II EMBEDDED DEVICE OPERATING ENVIRONMENT each object's GT box and anchors. The adaptive label assignment strategy will dynamically adjust the shape similarity threshold and IoU threshold so that the object matches the anchor in terms of shape similarity and overlap. It can mitigate the issue of unreasonable positive training samples selection to some extent.

III. EXPERIMENTAL RESULTS AND ANALYSIS
This article uses the engineering vehicle image collected by the monitor as the data source. First, train the model. Then, ablation experiments are conducted to determine the effect of the location-enhanced FPN, the grouping k-means clustering strategy, and the adaptive label assignment on the model. Finally, it is compared with other mainstream-lightweight object detection models to validate the method's advancement.

A. Data Introduction and Experimental Environment
There are 679 images with 1080 * 1920 pixels in the dataset of this experiment. Divide the dataset into training set and testing set by 8:2. There are 543 training data and 136 testing data. The recognized object types include car, truck, agricultural car, and excavator. In training set, the number of vehicles is 587, 389, 161, and 131, respectively. In testing set, the number of vehicles is 166, 110, 36, and 37, respectively. During the training phase, transfer learning and data augmentation (e.g., rotations, gamut transformation, etc.) are used to train the model.

1) Description of the Experimental Environment:
The model was trained on an NVIDIA GeForce RTX 2080 Ti GPU with 8 GB of memory (see Table I). As shown in Fig. 8, a Raspberry Pi 4B 2018 with an ARM v7l CPU is used as a limited-resource edge computing device to infer the model (see Table II). 2) Model Training: Due to privacy restrictions on surveillance video, only a limited number of images can be obtained. Therefore, this article only divides the train set and the test set, not the val set, to ensure a sufficient number of train sets. The improved model is pretrained on MS COCO 2017s train set to obtain pretrained weights. Other methods employ official COCO pretrained weights. On the premise that all models do not freeze the parameters of any layer, the pretrained model is fine-tuned using transfer learning on the train set in this article. The experiment adopts data augmentation based on image processing because it can avoid overfitting of the model on the dataset with a small amount of data. It has a weak impact on the recognition accuracy.
3) Model Parameters: The minibatch size is set to 4 for all other methods, and all other parameters retain their official configuration. The improved model is trained with an initial learning rate of 0.001 for 240 epochs. At the 130th, 160th, 178th, and 185th epochs, the learning rate decreases by a factor of 10. Minibatch size is set to 4, and the stochastic gradient descent algorithm is implemented. In this experiment, CrossEntropyLoss and CIoU Loss are used as loss functions for classification and regression. The before and after background classification loss L obj has a weighting factor of 34. The type classification L cls weighting factor is 64. The bbox regression L reg weighting factor is 3.4. In the inference phase of all methods, the nonmaximum suppression threshold is 0.45, the classification (cls) threshold is 0.5, the object (obj) threshold is 0.5, and the confidence (conf) scores on the bbox is the cls threshold multiplied by the obj threshold. The threshold α is determined primarily by the distribution of the data in two steps. First, the determination of the threshold α requires an analysis of the relative width and height distribution of the bounding box for all objects in the dataset. Second, α value is determined according to the distribution of the relative width and height of the small object. In Fig. 9, the relative width and height distributions of small objects in the training set are within 0.1 at the smallest increment. Therefore, the threshold α is set to 0.1 in grouping k-means clustering.    recognition. Video surveillance recognition mostly uses edge computing devices with limited computing power in real world. Because edge devices with limited computing power tend to use smaller image inputs and we hope the mAP50 to be as high as possible, the experimental standard for image input is set at 352 * 352 on the Raspberry Pi.
2) Impact of the Proposed Method: Ablation experiments were conducted for Yolo-Fastest V2 to verify the effect of each proposed method on mAP50. Table III presents that introducing location-enhanced FPN and grouping k-means clustering strategies can increase mAP50 by 1.01% and 2.04%, respectively. The experimental results indicate that the feature fusion in the location-enhanced FPN improves the model's ability to learn features. The grouping k-means clustering strategy contributes positively to assigning an appropriate anchor box to each small-scale object. Replacing the original shape matching in Yolo-FastestV2 with the proposed adaptive label assignment leads to a 1.53% improvement in mAP50. The experimental results demonstrate that adaptive label assignment ensures that each object is assigned to an anchor with a similar shape and high overlap. It ensures the model is trained with a large number of positive examples. The two methods of location-enhanced FPN and grouping k-means clustering strategy improve the mAP50 by 4.36%. Together, the three proposed methods increase mAP50 by 7.02%, proving that the combined scheme is effective.
To demonstrate the robustness of the proposed method, it is implemented within the yolov5-n model and validated using the Pascal VOC (VOC2007+VOC2012) dataset. The experimental results are shown in Table V. The FPN structure of yolov5-n increases the mAP50 by 0.10% after adding the location enhancement method. The grouping k-means clustering strategy improves mAP50 by 0.50%. Adaptive label assignment increases the mAP50 by 0.90%. After applying the grouping k-means clustering, the location-enhanced structure improved mAP50 by 0.80%. Moreover, the method that combines location-enhanced and adaptive label assignment increases mAP50 by 1.10%. Consequently, the experimental results demonstrate the robustness of the proposed method.

C. Compared With State-of-the-Art Models
To evaluate the impact of the proposed method on the model's ability to extract features, six current state-of-the-art lightweight detectors are compared with the model that has been enhanced by the proposed method. Among the six most advanced methods, Yolo-Fastest and Yolo-FastestV2 are lightweight from YOLOV3 and YOLOV5, respectively. The others are official lightweight models.
Compared with the original model Yolo-FastestV2, the proposed method can enhance the model's ability to extract feature. Fig. 11 demonstrates that when the input image size exceeds 352, the mAP50 curve of Yolo-FastestV2 tends to be flat. At this point, the capability of feature extraction reaches a bottleneck. Compared with the improved model of the proposed method, the model's ability to extract features is enhanced. After the improved model, the mAP50 curve still maintains an increasing trend when the input image size is greater than 352. When the input image size is less than or equal to 352, the improved model has a higher mAP50 than the model with the same parameter level. When the input image size exceeds 352, the improved model exhibits a slightly lower mAP50 growth trend than NanoDet-m, YOLOV3-Tiny, and Yolo-Fastet. Due to the model size, it results in insufficient extraction of the object's feature information.
Table VI compares the models quantitatively in terms of FLOPs, parameter quantities, recognition speed, and mAP50. The proposed method increases the number of parameters by 105.4K but does not increase FLOPs. Specifically, compared with YOLOV3-Tiny, YOLOX-Nano, NanoDet-m, Yolo-Fastest, and Yolo-FastestV2, the proposed method increases mAP50 by 9.42%, 9.55%, 3.09%, 11.42%, and 7.02% for Yolo-FastestV2. It is 7.35 FPS, 8.33 FPS, 7.94 FPS, 4.92 FPS, and 1.28 FPS faster in terms of recognition speed. The improved model is 17.38% lower in mAP50 than YOLOV5-n, but 8.28 FPS faster. Therefore, in a tradeoff between speed and accuracy, the performance of the enhanced model is optimal. In Table IV To visually demonstrate the effect of vehicle detection in Raspberry Pi, the experimental comparison results for each model are presented in Fig. 12. The comparison of Fig. 12(f) and (g) shows that the proposed method identifies fewer errors and makes bbox positioning more accurate. This demonstrates that our method fully learns the location features and semantic features of small objects under video surveillance. In Fig. 12(a)-(f), other detectors have the phenomenon of missing detection. However, Fig. 12(g) shows that the number of missing detections is less for the improved Yolo-Fastest V2. This demonstrates our method can choose better positive training samples in order to transmit an effective gradient to the model during the forward propagation. This can fully learn the effective features of the object. The recognition results of other scenes are shown in Fig. 13 for the improved Yolo-Fastest V2.

IV. CONCLUSION
To overcome the phenomenon of low recognition rate and inaccurate localization of bbox when identifying small-scale engineering vehicles in surveillance videos using lightweight object detection models in embedded devices, this article proposes a method that combines location enhancement and adaptive label assignment. This method prevents a decrease in model recognition speed and improves model recognition performance. It is crucial for the monitor to identify engineering vehicles that illegally transfer land, as this can reduce government departments' human and financial resources. The methods include location-enhanced FPN structure, grouping k-means clustering strategy, and adaptive label assignment. The location-enhanced FPN structure utilizes concatenate and addition feature fusion to fuse the shallow and deep layer feature maps together. This structure enables the deep feature maps to contain rich location information and improves the model's capacity to locate the bboxes of small objects. Second, the grouping k-means clustering strategy is capable of obtaining multiscale anchors for small objects. Finally, the proposed adaptive label assignment algorithm selects an appropriate anchor for each object to be identified and assigns the anchor to a label.
Tested in the monitoring dataset of engineering vehicles, this method improves the recognition accuracy of the original model from 42.90% mAP50 to 49.92% mAP50. The number of model parameters is increased by 105.5K, but there is no significant increase in FLOPs. The recognition speed in the Raspberry Pi can still reach 11.24FPS. To demonstrate that the proposed method has a certain degree of robustness, it is applied to yolov5-n. It provides a 1.10% mAP50 increase. It also demonstrates that the proposed method is applicable to both lightweight and large-scale models.
The proposed method still has some shortcomings. For instance, if the application scenario is modified, the model must perform k-means clustering again. Therefore, follow-up research will be improved in the direction of anchor sfree.