ALODAD: An Anchor-Free Lightweight Object Detector for Autonomous Driving

Vision-based object detection is an essential component of autonomous driving. Because vehicles typically have limited on-board computing resources, a small-sized detection model is required. Simultaneously, high object detection accuracy and real-time inference detection speeds are required to ensure safety while driving. In this paper, an anchor-free lightweight object detector for autonomous driving called ALODAD is proposed. ALODAD incorporates an attention scheme into the lightweight neural network GhostNet and builds an anchor-free detection framework to achieve lower computational costs and provide parameters with high detection accuracy. Specifically, the lightweight backbone neural network integrates a convolutional block attention model that analyzes the valuable features from traffic scene images to generate an accurate bounding box, and then constructs feature pyramids for multi-scale object detection. The proposed method adds an intersection over union (IoU) branch to the decoupled detector to rank the vast number of candidate detections accurately. To increase the data diversity, data augmentation was used during training. Extensive experiments based on benchmarks demonstrate that the proposed method offers improved performance compared to the baseline. The proposed method can achieve an increased detection accuracy while meeting the real-time requirements of autonomous driving. The proposed method was compared with the YOLOv5 and RetinaNet models and 98.7% and 94.5% were obtained for the average precision metrics AP50 and AP75, respectively, on the BCTSDB dataset.


I. INTRODUCTION
Autonomous driving will change the way we travel in the future and will be vital to the development of national and global economies. Commercial applications of autonomous driving have been realized for specific scenarios to date. However, because the current technology directions are mainly based on lidar, the system cost is high, and largescale deployment cannot be realized in this way. Vision-based methods have become a research hotspot because of their low cost. Object detection is one of the most important aspects of the development of this technological approach. Object detection methods can help autonomous vehicles (AVs) detect and recognize traffic signs, signal lights, pedestrians, and vehicles in traffic scenes automatically and can then transmit The associate editor coordinating the review of this manuscript and approving it for publication was Sudipta Roy . the results to the vehicle's decision-making module to ensure that the vehicle is driven safely and in accordance with traffic rules.
In recent years, many algorithms have demonstrated good object detection performances based on deep learning. However, the detection of objects in real traffic scenes remains a challenge. Some researchers have used complex models to obtain a high traffic object detection performance. However, because the on-board computational resources of vehicles are limited, these complex models cannot be deployed in embedded devices, or they are unable to achieve real-time detection during autonomous driving. Improving the detection accuracy of such a model when deployed on an on-board computing unit remains challenging.
Large and complex models are difficult to apply to AVs because they have insufficient on-board memory and computing power. Scenarios in autonomous driving typically require low latency and fast response speeds. Thus, the aim of this paper is to propose an object detection model that can realize the high detection accuracy required while maintaining small parameter sizes for autonomous driving applications.
To solve the problems described above, a lightweight detection framework based on a single stage was proposed. The contributions of the proposed method are as follows: (1) It improves an existing lightweight backbone network based on GhostNet [1]. The method integrates an attentional mechanism into the GhostNet backbone network, which can improve the network and allow it to focus on the objects to be detected, and uses a feature pyramid network for multiscale detection. (2) A novel complete intersection over union (CIoU)-aware head based on an anchor-free detector and a new confidence calculation method are designed to enhance the correlation between object classification and localization.
(3) A data augmentation approach driven by complex traffic scenarios was used to provide a more diverse dataset for training.
The remainder of this paper is organized as follows. In Section 2, we introduce related works on object detection in recent years. The details of the proposed method are presented in section 3. Section 4 focuses on the implementation of the proposed method and compares it with previous methods. Section 5 summarizes the conclusions of the work completed in this study and suggests future development directions for this work.

A. OBJECT DETECTION
Traditional object detection (using histograms of oriented gradients plus a support vector machine (HOG [2] + SVM [3]) approach) works by selecting candidate regions from a given image, extracting features from these regions, and then classifying these features using a trained classifier. In recent years, the rapid development of deep convolutional neural networks has led to performance improvements in the object detection field. In general, deep learning-based object detection methods can be divided into two types: single-stage methods, such as you only look once series (e.g., YOLO v1 [4], YOLO v2 [5], and YOLO v3 [6]), the single shot multibox detector (SSD) [7], and RetinaNet [8], and multistage methods, such as the two-stage region convolutional neural network (R-CNN) series [9]- [11], and cascade R-CNN [12]. The detection speeds of multi-stage methods make it difficult to achieve real-time detection, whereas one-stage detection algorithms can greatly improve the operating speed based on the premise that high accuracy is ensured.
In recent years, researchers have begun to focus on the application of object-detection methods to AVs. Wang et al. [13] proposed an improved faster R-CNN for traffic sign detection. Han et al. [14] used a revised faster R-CNN for small traffic sign detection. These studies achieved high detection accuracy, but the detection speeds when used on traffic scenes demonstrated the limitations of these meth-ods. He et al. [15] used a one-stage detector called YOLO-MXANet to perform small object detection in traffic scenes to improve detection speed. Based on mask R-CNN, the mask scoring (MS) R-CNN approach [16] uses a mask IoU head to learn the predicted mask quality and then obtain a new network structure that combines the characteristics of the example with the corresponding predictive mask to enable regression to the mask IoU. Jiang et al. proposed IoU-Net [17] to improve the interpretability of the regression by proposing an IoU-guided non maximum suppression (NMS) to introduce localization confidence as an ordering index in the NMS, and proposed an optimization-based bounding box refinement to replace the traditional regression-based method. Fan et al. [18] used CornerNet [19] with foreground attention to detect traffic objects. Xu et al. [20] used the center-based detection algorithm, FCOS [21], to detect objects in mobile scenarios. Some of the detection methods described above have already been applied to AVs.
However, there is still a problem regarding the low correlation between the classification score and localization accuracy. Generally, the final scores of the predicted boxes used in NMS are taken from the classification scores alone, and the localization information is not considered. In Figure 1, C represents the classification score. A high classification score with low IoU bounding boxes (Fn) cannot accurately represent the location information of an object, and it suppresses accurate boxes with high IoU (Tn) values during NMS when only classification scores are used for the final scores. This mismatch problem between classification and localization causes anchors with high IoU values but low classification scores to be filtered during the NMS. In this study, we propose a novel traffic object detection model to solve this problem.

B. FEATURE EXTRACTION
Feature extraction is an essential step for object detection. Traditional feature extraction methods are based on the use of handcrafted features. Yao et al. [22] proposed a traffic sign feature extraction method based on HOG features. Pedro et al. [23] proposed a deformable part model (DPM) to extract object features for both vehicles and pedestrians. The performance of these traditional methods is limited because they lack the ability to acquire spatial and semantic information. Their slow extraction speeds and low representational abilities cannot meet the requirements of autonomous driving systems. In recent years, deep convolutional neural network (CNN) algorithms have been widely used for feature extraction applications because of their competitive performance.
Additional feature extraction networks based on deep learning have been proposed based on AlexNet [24], which has a relatively small receptive field because of its limited network depth and the number and size of its convolution kernels. The visual geometry group network (VGGNet) [25] simplifies the network design workflow by increasing the network depth and stacking small convolutions to expand the receptive field, reduce the number of network parameters, and stack the same types of network blocks repeatedly. The network in network [26] structure first uses a global average pooling layer to replace the fully connected layer and then uses a 1 × 1 convolution layer to learn a nonlinear combination of the feature graph channels, which has become the mainstream method for feature fusion. GoogLeNet [27] uses convolution kernels of different sizes to provide enhanced multi-scale detection capabilities. However, the large number of required parameters limits the computational power of this network. Residual networks (ResNets) [28] introduced a residual block to reduce the gradient disappearance problem in deeper neural networks, thereby allowing these networks to acquire deeper features. Jung et al. [29] used ResNet to perform vehicle classification and localization in traffic surveillance systems.
The networks described above are not designed for use in mobile or embedded devices. Improvements in calculation accuracy also cause excessive memory consumption and computing power mismatch. The aim of this study is to design a more efficient network by reducing the number of network parameters without compromising network performance.
MobileNet [30] is a convolutional neural network proposed by Google that is small in size and less computationally expensive, and is thus suitable for use in mobile devices. MobileNet uses depth-wise separable convolution and width multiplication to reduce the number of required network parameters. The depth-wise separable convolution method decomposes a standard convolution into depth-wise and point-wise convolutions. The number of floating-point operations (FLOPs) of a standard convolution is given by K 2 MHWN , whereas that of the depth separable convolution is given by K 2 +M HWN.
In the general network architecture, M (number of output feature channels) K 2 (convolution kernel size squared) (e.g., K = 3 and M ≥ 32), and H , W , N are defined as the height, width, and number of channels, respectively. Biswas et al. [31] used an SSD and MobileNet to perform VOLUME 10, 2022 automatic traffic density estimation. ShuffleNet [32] mainly uses channel shuffle methods, point-wise group convolutions, and depth-wise convolution to modify the original residual blocks, thus reducing the number of arguments and computations required. Chen et al. [33] proposed an efficient neural network to perform point cloud analysis by shuffling the feature channels to capture fine-grained features. Although the models above can achieve better performance when implemented under a lightweight network framework, there is a lot of redundancy between feature maps, which increases the calculation of the feature map, and most of these calculations are redundant. GhostNet was proposed as a new end-to-side neural network architecture intended to use the redundancy between the feature graphs to generate feature graphs at a lower cost, as illustrated in Fig. 2. It use ''cheap operation'' to alleviate the increased computation due to content redundancy between feature maps of the same layer, which can reduce the computation and improve the detection speed of the model while maintaining the same detection accuracy. Based on the original feature image, the algorithm uses linear transformation to generate ghost feature maps that can extract the required information from the original feature maps with lower computational costs.
The main purpose of these lightweight networks is that they are designed to perform classification tasks and do not have the ability to identify local features. In this study, we focus on ways to improve the representation of local region features in lightweight networks.

III. PROPOSED METHOD
The method proposed in this study is based on an anchor-free approach that can reduce the number of calculations caused by the use of an anchor, with the aim of making the detection method move further toward high real-time accuracy.
The proposed method is illustrated in Fig. 3. The network architecture can be divided into three parts: backbone, feature pyramid network, and prediction head. We integrated the convolutional block attention model (CBAM) [34] into GhostNet to generate the attention map sequentially along the channel-wise and spatial-wise dimensions, which can find the attention region and extract its features more effectively in a lightweight manner in autonomous driving scenarios. The prediction head is built on the feature pyramid network, which consists of two branches: one branch is used for regression, including bounding box localization and IoU prediction processes, and the other is used to perform classification. We separate the classification and regression tasks into two independent sub-networks and then add a synchronized CIoU-aware head to the tail of the regression branch to solve the mismatch problem.

A. LIGHTWEIGHT BACKBONE NETWORK
Convolutional neural networks usually require a large number of parameters and floating-point operations (FLOPs) to achieve high accuracy. GhostNet can reduce the number of computational steps required to generate feature maps at lower computational costs. In addition, GhostNet eliminated some of the same feature maps in subsequent steps without losing any information, thus providing a more efficient way to generate feature maps.
GhostNet can solve the computational redundancy problem of traditional convolution operations; however, it ignores the need for effective feature extraction. In this study, we integrated CBAM into GhostNet to enhance the object area in the feature map. CBAM is an attention mechanism that combines spatial M s (F) and channel attention M c (F) information, which is defined as follows: where σ denotes the sigmoid function, F denotes a feature map, MLP is a multilayer perceptron, and AvgPool and MaxPool represent average pooling and maximum pooling, respectively.
where σ denotes the sigmoid function and f 7×7 represents a convolution operation with a filter size of 7 × 7.
The proposed module composed of CBAM is shown in Fig. 4. Although its small model representation ability is weak and the upper limit of the potential performance is reduced, the experimental results show that the Ghost module with CBAM can provide stable performance improvements with only a small number of additional calculations. The architecture of the proposed backbone network is shown in Table 1. Here, Conv2d n×n represents a standard two-dimensional convolutional layer with an n×n kernel size. GhostA represents the Ghost attention bottleneck. 40704 VOLUME 10, 2022

B. MULTI-SCALE DETECTION
The overlap between different ground truths may cause ambiguity that is difficult to handle during the training process. This ambiguity leads to a reduction in the detector performance. In this study, we show that the multiscale prediction method can effectively solve this problem. Following the approaches of the feature pyramid network (FPN) [35] and pyramid attention network (PAN) [36], the method in this study uses different levels of feature layers to detect objects of different sizes. We constructed a pyramid with five-scale feature maps{P 3 , P 4 , P 5 , P 6 , P 7 }, where the subscripts indicate the pyramid levels. P 3 , P 4 and P 5 were extracted using the backbone network layers {C 3 , C 4 , C 5 } and by performing a top-down convolution to reduce the degradation that occurred as the depths of the convolutional layers increased. P 6 , P 7 were processed using a 3 × 3 convolution with two strides from P 5 , P 6 , respectively. Multilevel detection shares information between the different feature layers, which can make the detector parameters more efficient and thus improve the detection performance.

C. CIOU-AWARE DECOUPLED HEAD
In object detection, the models that perform classification and regression tasks are relatively independent. A classification step is performed to divide the objects by category, and a regression step is performed to predict the locations of the objects. These two tasks differ in their functionality and complexity. The different tasks should be completed using different branches. However, one-stage detection models, such as the YOLO series, are in a process of continuous  evolution, which means that their detector heads remain coupled. Inspired by the segmenting objects by locations (SOLO) [37] instance segmentation algorithm, we propose a new detection head for object detection. As shown in Fig. 3, this involves using a decoupled head to replace the coupled head.
The decoupled head contained a convolutional layer with a 1 × 1 kernel to reduce the number of channels, and this layer was followed by two parallel branches. The classification branch contained two convolutional layers, and the regression branch contained four convolutional layers. Experiments have verified that the use of the decoupled head for the single-stage model significantly improves the convergence speed, as illustrated in Fig. 5.
The low correlation observed between the classification score and localization accuracy reduces the dense object detection capability of the detector. This mismatch problem between classification and localization causes only the classification confidence to be used for bounding box sorting in NMS, which means that anchors with high IoU and low classification scores will be filtered.
To solve this mismatch problem, the proposed method adds a synchronized subnetwork at the end of the regression branch. Specifically, the CIoU-aware head adds classification branch feature maps to increase the impact of classification on IoU predictions. This approach was used to assist in calculating the anchor score in the final NMS step. Therefore, the complete prediction head contained two subnetworks and three heads. The final score S conf of the anchor used in the final NMS step was obtained by adding the classification confidence to the IoU predicted by inference.
The parameter α here is a control coefficient used to balance the classification result and predict the CIoU [38] in the range [0, 1]. S conf considers the impact of both classification and localization on the inference results and reflects both the category and location information of the object, which is a more accurate detection confidence that can meet the object detection task requirements. S conf is used in NMS and can reduce the ranking of object detection with a high classification score and poor localization by altering the influence of the classification and localization on the score value.
In the proposed method, we used CIoU for bounding box regression and a CIoU-aware head. CIoU solves the problem in which it is not possible to directly optimize the parts in which the bounding box and ground truth do not overlap. The distance between the two boxes, overlap rate, scale, and penalty terms are all considered, making the bounding box regression more stable as a result. This can also prevent diver-gence during the training. The loss function of CIoU adds the impact term βv based on the loss function of the distance-IoU (DIoU) [38], which considers the length-to-width ratio between the predicted and ground-truth boxes.
The CIoU is defined as: where β is a trade-off parameter and v is a parameter used to measure the consistency of the aspect ratio. Furthermore, ρ(·) is the distance between the central points of the two boxes, and c is the diagonal length of the smallest enclosing box that covers the two boxes. The loss functions of the proposed model are as follows: L total = L cls + L loc + L iou (11) The total loss function of the proposed model comprises the following three parts: The first is the classification loss L cls , which includes focal loss (FL). L iou is part of the CIoUaware head, which includes the binary cross-entropy loss, and L loc is part of the bounding box regression. L loc and L iou are only computed for positive examples.

D. DATA AUGMENTATION
Deep convolutional neural networks have been successfully applied in the field of computer vision. This type of network is data-driven and requires a large quantity of training data. As the depth of network architecture increases, an increasing number of parameters must be learned. In the proposed method, traditional global pixel augmentation methods (e.g., random scaling, cropping, translation, shearing, and rotation) were used to enhance data diversity. We also used data augmentation methods that have been proposed in recent years, as illustrated in Fig. 6; for example, MixUp [39] mixes two random samples proportionally, and the classification results are then distributed proportionally; Cutout [40] replaces the sampled regions at random with zero-pixel values, and the ground truth remains unchanged; CutMix [41] fills parts of other images from the training dataset into the sample, and the ground truth is then distributed with a certain proportionality. This can improve the robustness of the model without incurring additional cost.

A. DATASET AND EVALUATION METRICS.
The common objects in context (COCO) [42] datasets were used to evaluate the generalization ability of the model, and the BCTSDB [43] and KITTI [44] datasets were used to test the model's detection ability in traffic scenarios. BCTSDB is a traffic sign dataset that includes 15,690 images and 25,243 annotations with 14121 training images and 1569 test images, and has labels that are divided into three categories: prohibitory, mandatory, and warning. The KITTI dataset includes three categories, comprising vehicles, pedestrians, and cyclists, and consists of 7481 training images and 7518 test images, with 80,256 labelled objects in total. In this study, all of our experiments followed the COCO format. The training set was randomly selected from the dataset and the remainder was used as the test set. The training set was used to train the model and the test set was used to test the model performance. The final experimental results were obtained by repeating this operation thrice and averaging the results.
The experiment used the average precision (AP) to compare the different models and their respective accuracies, including AP (IoU =.50-.95), AP 50 (IoU =.50), AP 75 (IoU =.75), AP L (large, area > 96 2 ), AP M (medium, 32 2 < area < 96 2 ), and AP S (small, area < 32 2 ), followed by the COCO evaluation format. Both recall and precision are considered during the calculation of the AP, which takes the average value of the precision rates at each recall point over a range from 0 to 1. Precision is the ratio at which the original object is detected accurately, and recall is the proportion of labeled objects in the image that are detected correctly. AP to AP 75 considers accuracy from the perspective of IoU, and AP S to AP L evaluates model performance from the scale diversity of objects.
When compared with the original convolution, the theoretical speed-up ratio of the Ghost module is given by where d × d has a similar magnitude to k × k, and s c. Here, k × k represents the convolutional kernel size, h and w give the height and width of the output data, respectively, d × d represents the linear operation kernel size, and N is the channel number of the feature maps. The Ghost module has 1 identity mapping and m·(s−1) = n s ·(s−1) linear operations. In this study, we used d = 3 and s = 2 in the following experiments for both the effectiveness and efficiency.

B. UNIT VALIDATION EXPERIMENT
The experimental parameters are presented in this section. All experiments in this study used the same computational hardware to demonstrate the performance of the proposed method. The computer was configured using two NVIDIA TITAN V graphics cards, with a total of 24 GB of video random access VOLUME 10, 2022  memory (VRAM). The network structures were implemented using PyTorch. The default hyper-parameters used in the proposed method and other SOTA object detection methods are the same as those used in MMDetection [45]. The input images were resized to a maximum of 640 × 640 pixels without changing the aspect ratio. The backbone networks of the different methods were pre-trained using the ImageNet dataset. Other settings for all experiments were consistent with MMDetection unless otherwise specified. For the proposed method, the initial learning rate was set at 2.5 × 10 −2 , and the warm-up ratio was set at 0.1.
To test the effectiveness of GhostNet, we replaced Darknet in YOLOv3 with GhostNet and compared the results obtained with those from the original YOLOv3 and YOLOv3-SPP. As shown in Table 2, the experimental results proved that GhostNet can provide significant improvements in terms of the number of parameters required, computational complexity, and accuracy.
The results of the comparison of the different models on the COCO-val dataset are shown in Table 3. These results show that the CIoU-aware approach with a decoupled head can improve the correlation between localization and classification, and can thus effectively improve the detection accuracy.
The score for each anchor was calculated using S conf . The hyper-parameter α is used to balance the effects of classification and regression. We evaluated different α values on the COCO-val dataset. Experiments showed that the best performance was obtained at α = 0.5. In particular, when α = 1, this means that the classification alone is used for the confidence calculation, whereas the influence of the bounding boxes is not considered. The experimental results in Table 4 show that the performance at α = 1 is not as high as that when α takes other values, indicating the effectiveness of the CIoU-aware method.
The results in Table 5 show that the proposed Ghost attention bottleneck can achieve a better performance than the other models in the ImageNet dataset. The visualization results for our proposed model (Ghost attention bottleneck) with a baseline (GhostNet) are illustrated in Fig. 7. The first row shows the original images of traffic signs in the BCTSDB dataset. The second row shows the visualization results for the baseline, and the third row shows the visualization results for the proposed model. The figure clearly shows that the Ghost attention bottleneck can cover the object region to be detected and provide a better performance than the baseline model.
Ablation experiments are performed to verify the effectiveness of the proposed module. The CIoU-aware decoupled head, data augmentation, and anchor-free model were gradually added to the YOLOv3-GhostNet 1.1x baseline. The same parameters and training schemes were used in each ablation experiment. The ablation results for the COCO dataset are listed in Table 6. Owing to factors such as density and small objects on a single image in the dataset, most detection algorithms achieve low accuracy on the COCO dataset. In the    table, AR is the average recall because we used the COCO format to evaluate different methods. AR max=10 means AR given 10 detections per image, and AR medium means AR for medium objects (32 2 < area < 96 2 ). As can be seen from the results listed in the table, the component CIoU-aware decoupled head, anchor-free model, and data augmentation improved the AP by 2.1%, 0.6%, and 2.4%; AR max=10 by 3.9%, 4.5%, and 1.8%; and AR medium by 2.9%, 6.9%, and 1.8%, respectively.

C. OVERALL VERIFICATION EXPERIMENT
To verify the generalization of the model, we compared the performance of the proposed model with those of other models on the COCO-val dataset, as presented in Table 7, which lists the detection results based on the YOLO series. The experiments demonstrated that our model remains com- petitive on common datasets. Compared with YOLOv3-MobileNetV2, our methods can improve the detection accuracy by 3.1% while increasing the detection speed because of fewer computations. Although our method is 3.5% lower than YOLOv5-s in terms of accuracy (mean AP or mAP), it offers advantages in terms of both the parameters and computation.    Specifically, the computational cost of our method is onethird of that of YOLOv5-s, which means that the proposed method is more competitive in lightweight object detection applications.
The results of the comparisons between the performances of the different methods are presented in Table 8, which lists the detection results that include those for the multi-stage algorithms (faster R-CNN, cascade R-CNN) and the singlestage algorithms (RetinaNet, FCOS, YOLOv5, YOLOv3, YOLOX). Our method achieved high detection accuracy with fewer parameters, which yielded more competitive results, with AP, AP50, and AP75 values of 78.6%, 98.7%, and  94.5%, respectively. A comparison with generic multi-stage networks and single-stage networks shows that our method has the advantages of low computational requirements and a low number of parameters that allow it to overcome the memory limitations in the autonomous driving field. Compared with some lightweight networks, including the original YOLOX-s, YOLOv3, and YOLOX with the MobileNetv2 lightweight backbone network, our method can obtain higher detection accuracy. It is evident from the results that our method can achieve a performance comparable to that of YOLOv5 with a low number of parameters and computations.
The mAP curves for the different methods when applied to the BCTSDB dataset are shown in Fig. 8, demonstrating that the proposed model converges more rapidly and has a higher AP value.
We also evaluate the proposed method using the KITTI dataset. As shown in Table 9, the detection accuracy of the proposed method is significantly improved compared to that VOLUME 10, 2022 of the existing algorithms. The size of the image used in this part was the same as that of the original dataset.. Our method also demonstrated the shortest processing time while maintaining high detection accuracy. Fig. 9 and 10 show the detection results obtained for the KITTI and BCTSDB datasets, respectively. The results clearly show that our method can effectively detect objects in traffic scenes.

D. DISCUSSION
In AVs, real-time performance and accuracy are two important performance indicators. This ensures that the AVs detect objects quickly and accurately and make autonomous driving safe. The main purpose of the proposed method is to improve the model in two aspects. In the unit validation section, we verified the effectiveness of the proposed module Ghost-Net with attention, CIoU-aware decoupled head, anchor-free, and data augmentation. The overall section compares the proposed method with a lightweight detection method on public datasets. The results show that the number of parameters of our proposed method is slightly improved, and the two important indicators of detection accuracy and real-time performance are improved, which can promote the reliability of vision-based object detection algorithms in autonomous driving systems. Although the proposed method has been improved compared to other existing methods, it also faces many problems. For example, the verification of the current algorithm is based on public datasets, whereas in actual traffic scenarios, the influence of weather, illumination, and other factors reduces the generalization ability of the detection model. It will take years to the fully automated environment, in such a mixed (AVs and human-driven vehicles) traffic scene at present, the relationship between visual perception objects among multiple vehicles is a challenging problem.

V. CONCLUSION
In this study, we propose an anchor-free lightweight object detector for autonomous driving applications. The detector can achieve a high detection accuracy and trade-off with a small-sized model. The approach incorporates an attention scheme in a lightweight neural network called GhostNet and adds an IoU branch to the anchor-free decoupled detector to rank the large number of candidate detections accurately. Data augmentation is used to enhance the robustness of the detection model in real-world scenarios. Extensive experiments on COCO, KITTI, and BCTSDB datasets verified the effectiveness of the proposed algorithm.
Furthermore, the proposed method achieved high detection accuracy when using a small size detection model; when applied to real traffic scenarios, the interference in real complex scenarios is not considered. In future work, ALODAD will be improved by the application of specific data augmentation methods or domain adaptation techniques.