SWDet: Anchor-Based Object Detector for Solid Waste Detection in Aerial Images

As we all know, waste pollution is one of the most serious environmental issues in the world. Efficient detection of solid waste (SW) in aerial images can improve subsequent waste classification and automatic sorting on the ground. However, traditional methods have some problems, such as poor generalization and limited detection performance. This article presents an anchor-based object detector for solid waste in aerial images (SWDet). Specifically, we construct asymmetric deep aggregation (ADA) network with structurally reparameterized asymmetric blocks to extract waste features with inconspicuous appearance. Besides, considering the waste with blurred boundaries caused by the resolution of aerial images, this article constructs efficient attention fusion pyramid network (EAFPN) to obtain contextual information and multiscale geospatial information via attention fusion. And the model can capture the scattering features of irregular shape waste. In addition, we construct the dataset for solid waste aerial detection (SWAD) by collecting aerial images of SW in Henan Province, China, to validate the effectiveness of our method. Experimental results show that SWDet outperforms most of existing methods for SW detection in aerial images.

statistics, every 10 000 square meters of new construction generates 500 to 600 tons of waste, while every 10 000 square meters of demolition generates 7000 to 1300 tons of waste. SW causes serious environmental pollution, health issues, and social impacts [1], [2]. Therefore, it is necessary to detect SW and implement appropriate waste treatment to improve the quality of life.
In traditional methods, manual feature design and data-driven methods are commonly used to obtain SW location information. For example, Huang et al. [3] used the threshold segmentation algorithm based on hue, saturation, value (HSV) and k-means clustering algorithm to real-time collect and process construction waste images. However, due to the weak expression capability of extracted color features, detection performance is poor. Martin et al. [4] extracted waste features using histogram of oriented gradient and interpreted waste in images using random forest algorithm. Guo et al. [5] used adaptive Surf feature extraction algorithm and geometric hash model matching algorithm to achieve fast municipal solid waste matching. However, machine learning methods are influenced by manually selected feature variables, thus, they are only suitable for specific scenarios, and manually designed features are usually only effective for a specific type of waste. It is extremely difficult to design features with high robustness and generalization.
On the other hand, with the advancement of convolutional neural network (CNN) [6], [7], the application of CNN in the field of aerial images has become a hot research topic. For example, Tirandaz et al. [8] proposed a SAR image segmentation method for polarimetric synthetic aperture radar (PolSAR), using unsupervised spectral regression for feature learning. He et al. [9] improved a deep learning-based rotating ship detection method in SAR images by learning polar coordinates coding to solve the problem of boundary discontinuity in rotating boxes regression. To address the issue of low IoU between horizontal and ground truth boxes, Cheng et al. [10] proposed anchor-free oriented proposal generator to generate high-quality oriented boxes. Wang et al. [11] extracted object features using deep layer aggregation (DLA) [12] with deformable convolution as the backbone, then embedded dynamic gradient adjustment in loss function to balance the number of positive and negative samples in remote sensing images. Cheng et al. [13] proposed dual-aligned oriented detector (DODet) to solve feature misalignment problems between classification and localization in remote sensing images, thereby improving detection performance of oriented dense objects. Ming et al. [14] proposed sparse label assignment to improve training performance of densely This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ arranged samples, and feature pyramid network (FPN) with coordinate attention [15] was used for accurate positioning. Cheng et al. [16] proposed Discriminative CNNs (D-CNNs) to solve the problems of interclass similarity and intraclass diversity in remote sensing image scene classification. However, different from common remote sensing objects, SW detection difficulties in remote sensing images are typically as follows: 1) SW is often in complex environments with similar feature information, making it difficult for the network to extract inconspicuous waste features; 2) waste often has an irregular shape, fragmented and scattering features, making it difficult for the network to capture its location information.
Deep learning can automatically learn features from large amounts of images, and it has superior feature extraction and representation capabilities compared with traditional methods. To address the aforementioned difficulties, researchers are gradually replacing traditional methods with deep learning algorithms [17], [18], [19]. For example, Chen et al. [20] proposed a low-altitude remote sensing video real-time detection method for waste scattered areas, and then used YOLOv4 to conduct unmanned aerial vehicle (UAV) remote sensing spatial positioning for scattered waste areas. Youme et al. [21] proposed a method for automatically detecting of waste dumping sites, then used single-shot multibox detector (SSD) to extract features from images captured by UAV. The images were taken in a small coastal region of Senegal. Abdukhamet et al. [22] improved RetinaNet with DenseNet and applied it to satellite images of Qingpu District, Shanghai, China, achieving an accuracy of 84.7% in landfill detection. Wang et al. [23] proposed SRAF-Net, which consists of feature extraction, multitask detection and postprocessing to obtain garbage dump feature information with inconspicuous appearance. To validate the effectiveness of the method, a new public dataset garbage dumps dataset was constructed using field survey results from Ministry of Housing and Urban-Rural Development of China. Anjum et al. [24] proposed a CNN-based garbage detection and localization system, which uses unsupervised learning methods to train images labeled as waste and non-waste, and the performance was evaluated with pixel-level accuracy and qualitatively evaluated by human experts. Peter et al. [25] designed a CNN to classify construction waste images and identified seven types of SW with 94% accuracy. However, detecting SW in aerial images still remains a challenging task.
Therefore, solid waste detector (SWDet) is introduced in this article to solve SW detection as a remote sensing image object detection, allowing waste detection to be completed more efficiently. First, we build ADA using asymmetric blocks (ABs). DLA aggregates previously aggregated features multiple times iteratively and hierarchically, so that the backbone can better integrate semantic and spatial information across layers. AB improves representation capability of the standard convolution kernels by fusing convolution branches in various directions, then uses different convolution kernels to obtain different receptive fields. The embedding of ABs allows DLA to better distinguish between inconspicuous waste and feature information in complex backgrounds. Second, we construct EAFPN based on PANet using attention fusion to address the problems of irregular shape and blurred boundaries of waste. Attention fusion first fuses multiscale features containing context information, then uses enhanced receptive field attention (ERFA) to accurately locate waste with fragmented and scattering features.
In summary, the contribution of this work can be summarized as follows.
1) Asymmetric deep aggregation (ADA) is used as the backbone to extract waste features with inconspicuous appearance, increasing the accuracy and efficiency of waste detection in high-resolution aerial images. 2) SWDet uses efficient attention fusion pyramid network (EAFPN) based on attention fusion to capture the scattering features and multiscale geospatial information of irregular waste, thereby solving the problem of object boundary ambiguity caused by resolution. 3) Different from most waste datasets based on shortdistance shooting, we construct a dataset for solid waste aerial detection (SWAD) that contains a large number of images taken from a bird's eye view. Experimental results demonstrate that SWDet is effective for detecting SW in aerial images. The rest of this article is organized as follows: Section II reviews related methods. Section III describes the proposed method in detail. Section IV shows experiments and results. Section V discusses relevant issues. Finally, Section VI concludes this article.

A. CNN-Based Object Detection
Recently, CNN has stronger representation ability, which confirms its performance on object detection. Most CNN-based detection methods can be divided into two-stage and one-stage. In general, two-stage methods first extract regions of interest (RoI) from the input image before performing bounding box classification and regression according to RoI. In 2013, R-CNN [26] used selective search to extract candidate boxes, finally used SVM for classification. In 2015, Fast R-CNN [27] used specific rectangular boxes to segment feature maps in order to obtain features from various regions. In 2017, Faster R-CNN [28] combined region proposal network (RPN) and Fast R-CNN, allowing the network to detect objects end-to-end. On the other hand, one-stage methods regard object detection as a regression, and performs classification and localization at the same time. In 2016, Redmon proposed YOLO [29], which discards the step of generating candidate boxes and predicts the image directly. Although the accuracy is not as high as Faster R-CNN, computation and time are significantly reduced. In the same year, SSD [30] removed RPN, combined the regression of YOLO with the candidate box mechanism of Faster R-CNN, and then used VGG-16 as feature extraction network to predict classification score and bounding box regression offset directly. In 2017, YOLOv2 [31] achieved fast detection speed and high accuracy through multiscale training and hierarchical classification. In 2018, YOLOv3 [32] used Darknet-53 for feature extraction and FPN for detecting feature maps of different sizes. In 2020, YOLOv4 [33] adopted CSPDarknet-53 for feature extraction and combined multiple optimization methods to improve accuracy. In the same year, YOLOv5 [34] adopted hyperparameters to balance model size and detection speed, including various models, such as YOLOv5n, YOLOv5s, YOLOv5m, etc. Because YOLOv5s can maintain high detection accuracy while remaining fast detection. In this article, we use YOLOv5s as the baseline for subsequent work. Table I lists several existing CNN-based waste datasets. It can be seen that most datasets are based on short-distance shooting. The dataset for SW detection in aerial images, which can be used for detection or classification and contains annotated files, is an almost unexplored area. In this article, we construct the dataset for solid waste aerial detection (SWAD) (illustrated in Section IV-A2).

A. Network Structure
The proposed SWDet for SW detection in aerial images is shown in Fig. 1, which is mainly composed of three parts: ADA, EAFPN, and detection head. After ADA, feature maps , and C 5 ∈ R 512×8×8 are output. Then, since feature maps C 1 and C 2 lack sufficient deep semantic information, we select feature maps C 3 , C 4 , and C 5 to send to EAFPN for feature fusion. Finally, the output feature map sizes are 128 × 32 × 32, 256 × 16 × 16, and 512 × 8 × 8 to predict targets of different sizes.

B. Asymmetric Deep Aggregation
SW in aerial images usually has inconspicuous appearance, making it difficult for the network to distinguish between detected objects and complex backgrounds, resulting in poor detection performance. Many researchers have focused on designing deeper or wider networks to improve accuracy, such as ResNet [47] with increased depth and ResNeXt [48] with increased width. Although most of methods have improved accuracy, the capability to extract SW with inconspicuous appearance still needs to be strengthened. As illustrated in Fig. 1, we use ADA as the backbone to efficiently aggregate feature information. It can be seen that ADA divides multiple stages according to feature resolution, and each stage continuously performs iterative fusion from shallow to deep layers, so as to obtain rich feature combinations at different levels. ADA has two architectures: iterative deep aggregation (IDA) and hierarchical deep aggregation (HDA).

1) Iterative Deep Aggregation:
The deeper the feature map, the more semantic information it contains, while the more easily detailed information is lost. Different from traditional skipping, IDA focuses on fusing current feature maps to improve the capability of the model to judge what an object is at the semantic level, as shown by the blue arrows in Figs. 1 and 2. IDA iteratively aggregates shallow feature information to form deeper feature maps with richer semantic information. The aggregation function of IDA can be expressed as where I is the IDA function, x 1 , . . . , x n is a series of feature layers, N is the aggregation node, δ is the nonlinear activation function, BN is the batch normalization, W i and b are the convolution weights. Specifically, when n = 3, The result of IDA is equal to the result of the aggregation node aggregating x 1 with x 2 , then with x 3 .

2) Hierarchical Deep Aggregation:
Although IDA has some aggregation effects, feature aggregation between modules is still insufficient. HDA is committed to merging feature information  between blocks to improve the capability of the model to judge, where an object is at the spatial level, as shown by the yellow dashed box in Fig. 2. HDA takes the output of current aggregation node as the input of the next subtree, then propagates previous gradient information to the next layer, allowing HDA to learn feature combinations across multilayer structures. The aggregation function of HDA can be expressed as where H n is the HDA function, n is the depth, N is the aggregation node, AB is the asymmetric block (illustrated in Section III-B3). Specifically, when n = 2, , that is, when the depth is 3, the result of HDA is equal to the result that the aggregation node aggregate R 3 2 (x), R 3 1 (x), L 3 1 (x) and L 3 2 (x).

3) Asymmetric Block:
We improve the representation and feature aggregation capability of the backbone with AB, which are realized by structure-reparameterized AC block [49] and RepVGG training module [50], as illustrated in Fig. 3. AC block improves the representation of standard convolution kernels by fusing a set of asymmetric convolution branches with squares, horizontal kernels, and vertical kernels. RepVGG training module includes one identical branch, 3 × 3 branch and 1 × 1 branch. Different branches can obtain different receptive fields by using different convolution kernels. Specifically, AB can be expressed as are the initial input, intermediate output, and final output, respectively, * is the convolution operation, BN is the batch normalization, δ is the nonlinear activation function, W (i,j) are the accumulated mean, standard deviation, learned scaling factor, and bias of BN layer after i × j convolution, respectively.
Finally, skip connection is added to effectively preserve the coarse-grained features of initial input feature map.
Multibranch structure increases the path of multiple gradient flows and solves the vanishing gradients problem in deep networks. After convolution and normalization stacking, multibranch fusion can provide a more robust feature representation, make the network easier to converge.

C. Efficient Attention Fusion Pyramid Network
Remote sensing images usually blur object boundaries due to the resolution, making it difficult for detection layers to locate scattering features of waste. Many researchers are dedicated to building complex paths or fusions by adding and connecting features from different layers. However, most of them only contain fixed linear aggregation of feature maps without contextual information. We design EAFPN to fuse feature information containing semantic and contextual information, and then strengthen the capability of the detection layer to capture waste scattering features and multiscale geospatial information. EAFPN is shown in Fig. 4. Fig. 4 shows that EAFPN can obtain contextual information by fusing semantic and scale inconsistent features. However, feature fusion of different sizes contributes differently to the output. Therefore, the model assigns a learnable weighted fusion coefficient to each fusion feature map. Let F i ∈ R C×H×W participating in fusion, where C, H, and W are the number of channels, height, and width of the feature map, respectively, then the output F can be expressed as where N is the number of fused feature maps, F i is the feature maps participating in fusion, W i is the fusion coefficient, which can obtain the best fusion result by adjusting the contribution of F 1 and F 2 , ERFA is enhanced receptive field attention (illustrated in Section III-C1).

1) Enhanced Receptive Field Attention:
Attention mechanism is designed to extract valuable information from large amounts of input data while ignoring and suppressing irrelevant information. SE [51] is a classic work based on channel attention mechanism with the innovation of focusing on the weight of each channel in learning feature layers. CBAM [52] adds spatial attention mechanism based on SE, and aggregates features with global average pooling (GAP) and global max pooling (GMP). However, the use of GAP and GMP is insufficient to capture multiscale geospatial information of waste with fragmented and scattering features, we designed ERFA, as illustrated in Fig. 5.
Given an input feature map F ∈ R C×H×W , the directionaware feature maps F x ∈ R C×H×1 and F y ∈ R C×1×W are generated by encoding the channels along horizontal and vertical directions with two average pooling kernels of size H × 1 and 1 × W , respectively. In addition, since different receptive fields have different effects, two multibranch structures with different receptive fields are also used to introduce spatial information coding, which can be expressed as where F GAP is the GAP, conv 3×3 , and conv 5×5 are two convolution layers with kernel size of 3 × 3 and 5 × 5, respectively. Then, F x and F y are concatenated and compressed to reduce parameters, resulting in an intermediate feature map f ∈ R C r ×1×(H+W ) , where r is the scaling ratio, which is set to 32 in this article. After that, f is divided into two separate tensors F x ∈ R C×H×1 and F y ∈ R C×1×W . After 1 × 1 convolution, the  number of channels is the same as the input F , and generates attention weights in different directions F x and F y . Then, fuse the four branches to obtain the fine feature F * ∈ R C×H×W . Finally, skip connection is added to preserve the coarse-grained features of initial input, and output Y ∈ R C×H×W , which can be expressed as where [·, ·] is the connection operation, conv is the convolution operation, BN is the batch normalization, δ is the nonlinear activation function, σ is the sigmoid function, conv 1×1 is the 1 × 1 convolution used to reduce parameters. ERFA can capture location information and channel relationships to locate interested targets.

D. Loss Function
In waste detection task with the detection difficulties, such as inconspicuous appearance, irregular shape, and blurred boundaries, we use multitask loss function to reduce or enhance the influence of a task in backpropagation update by assigning different weights, so that the network can calculate the network parameters with the best performance. SWDet loss is calculated in three parts [34], which can be expressed as where L loc , L conf , and L cls are the location loss, classification loss and confidence loss, respectively, λ loc , λ obj , λ noobj , λ cls is the weight coefficient of different types of losses, and we set the values as 1, 1, 0.5, and 1, respectively, S 2 is the number of grids of the input image, B is the number of boxes generated by each grid, I obj ij , I noobj ij ∈ {0, 1} represents whether an object falls in the j-th border of the i-th grid, x i , y i , w i , h i , C i , and p i (c) are the real center coordinates, height, width, confidence, and probability that the i-th ground truth box belongs to class c, height, width, confidence, and probability that the i-th predicted box belongs to class c, c is the predicted class. Fig. 6 depicts the overall detection flowchart of SWDet, which is divided into three stages: dataset preparation, training, and inference stage. First, we collect and annotate the dataset after determining the research area, then use data enhancement to obtain a more robust dataset for training and validation. Second, SWDet is trained and optimized on the TACO and SWAD datasets until loss converges to generate a weight with improved model performance, allowing it to more effectively extract and express waste feature information. Finally, we select the trained weight to detect and locate SW in aerial images. Non-maximum suppression (NMS) is used to filter the redundant bounding boxes in order to obtain the final detection results.

IV. EXPERIMENTS
A. Datasets 1) TACO Dataset: TACO dataset [41] is a publicly available dataset for detecting and segmenting wild waste. The dataset contains 1500 waste images covering 60 categories. Cigarette and plastic film, which is easily confused with sand and stones and do not have inconspicuous appearance, clear plastic bottle and drink can with irregular shape, as well as common household waste such as plastic bottle cap. These five classes, which contain 813 images, are tested and named TACO-5. Random data augmentation is applied to the dataset to obtain 1626 images.
2) SWAD Dataset: However, TACO dataset only has targets from short-distance shooting. Waste from long-distance shooting should also be noticed to guide pollution prevention on the ground. At present, there is a lack of remote sensing image datasets about SW in the field of remote sensing. This study takes Henan Province, China as an example, collects remote sensing images from Google Earth [53], naming SWAD. The dataset contains 998 images with jpg extension from WorldView-2 satellite in Maxar Technologies and SPOT satellite in CNES/Airbus, covering urban, village, and mountain scenes, including SW, such as gravel, muck, industrial waste, and household waste. The spatial resolution is 1.8 m. Fig. 7 depicts the research area. The dataset is available at www.kaggle.com/shenhaibb/swaddataset.
Furthermore, we have invited several experts to manually label SW in aerial images. Specifically, make sense online image annotation tool [54] is used and the XML annotation files corresponding to the image file are generated. The dataset covers one category. The label is composed of horizontal bounding boxes, and the target position coordinates are determined by  (x, y, w, h), where (x, y) is the coordinate of the target center point, w and h are the width and height of the target, respectively. We also use the heat map to visualize the size and location of waste in the SWAD dataset, as shown in Fig. 8. It can be seen that the dataset contains targets of various sizes and location distribution is also uniform. Due to the limited number of acquired images, we perform random data augmentation (illustrated in Fig. 6) on the original dataset to obtain 1996 images, overcoming the shortage of training samples and improving generalization capability of the model. Table II    framework is pytorch 1.9.1, operating system is Ubuntu 20.04.4 LTS, CUDA version is 11.4, hardware platform is Intel (R) core (TM) i7-10700 k CPU @ 3.80 GHz, single CPU, NIVDIA GeForce GTX 3070 graphics card (8 G memory) and initialization parameter settings are shown in Table III.
2) Evaluation Criteria: We choose Precision (P), Recall (R), Average Precision (AP), mean Average Precision (mAP), and F1 to evaluate the capability of the model to detect SW. The P and R formula is as follows: where TP is true positive, TN is true negative, FP is false positive, FN is false negative. In order to consider the influence of P and R, we take the area enclosed by P-R curve as AP. mAP is the average value of AP across all categories. F1 is used to balance the relationship between precision and recall. The AP, mAP and   III  INITIALIZE PARAMETER SETTINGS   TABLE IV  RESULTS OF DIFFERENT METHODS ON THE TACO DATASET where K is the total number of classes, R n is the recall of a given category n, P n (R n ) is the precision when the recall of the class is R n .

C. Experimental Results
We compare SWDet with other object detection algorithms, such as ATSS [55], AutoAssign [56], Fovea [57], PAA [58], VFNet [59], YOLOF [60] and YOLOv5s [34], conduct experiments on the TACO and SWAD datasets, respectively. Table IV shows that SWDet has achieved state of the art testing results. Specifically, SWDet outperforms ATSS, AutoAssign, Fovea, PAA, VFNet, YOLOF, and YOLOv5s by 12.87%, 5.08%, 6.59%, 8.91%, 4.65%, 8.50%, and 2.45%, respectively. Among them, the poor detection performance of ATSS in items such as clear plastic bottle and plastic film, resulting in low mAP of 50.25%. On the one hand, ATSS lacks the capability to extract features from waste in complex backgrounds, on the other hand, ATSS lacks the capability to accurately position waste. However, SWDet can effectively extract inconspicuous waste features, better capture the scattering features of irregular shape waste, highlight the foreground, eliminate the background noise, and improve detection performance. Fig. 9 shows the visualization results on the TACO dataset. It can be seen that the waste usually causes detection interference due to various directions and shapes. Other algorithms, such as AutoAssign and VFNet, have some false detections and missed detections. SWDet outperforms other algorithms, reduces false detection and missed detection, such as waste with interclass similarity, as shown in Fig. 9(a), waste with scattering features and blurred boundaries, as shown in Fig. 9(b)-(c), complex backgrounds make it difficult to distinguish waste with feature information, as shown in Fig. 9(d)-(e). This indicates that SWDet outperforms other methods for detecting waste more efficiently and accurately.

1) Experiments on TACO:
2) Experiments on SWAD: We conduct experiments on the SWAD dataset to validate the detection performance of SWDet in aerial images based on long-distance shooting, as illustrated in  Table V. It can be seen that SWDet outperforms other detection algorithms on AP 50 and AP 75 , and achieves the optimal detection performance. Other algorithms, such as AutoAssign and Fovea, have slightly lower detection performance than SWDet, but inference time is more than 4 times slower. Compared with YOLOv5s, SWDet improves by 3.25% and 5.12% on AP 50 and AP 75 , respectively. Although the inference time of SWDet increases, it still meets the requirements of real-time detection. SWDet still has great advantages in model parameters, while achieving a better balance of accuracy, speed, and parameters. Thus, the increased inference time and model parameters are acceptable. Fig. 10 shows the visualization of the SWAD dataset. It can be seen that the waste is often in the backgrounds with complex texture and interference factors, as shown in Fig. 10(a). Furthermore, most SW features are fragmented and the boundaries are burred, making it difficult for the model to capture target location information and encode multiscale spatial information, as shown in Fig. 10(b)-(c). However, SWDet can accurately detect waste. This is due to the capability of ADA to extract as many inconspicuous waste features as possible. Furthermore, EAFPN with feature fusion and localization capability can also better capture the scattering features and multiscale geospatial information of the target. We also detect the images after data enhancement to validate the robustness of the model, as shown    Fig. 10(d)-(e). The results show that SWDet also has good superiority in the images after data enhancement.

D. Ablation Experiments
In ablation experiments, we use YOLOv5s as the baseline to build three models to validate the detection performance of each module, as illustrated in Table VI. The results show that each module can improve overall detection performance.

1) Asymmetric Deep Aggregation:
It is difficult for the original backbone to obtain a good detection effect when extracting inconspicuous waste. However, Table VI shows that model A outperforms the baseline by 1.83% in F1 and 1.92% in mAP, because ADA can efficiently extract insignificant waste feature information.
2) Efficient Attention Fusion Pyramid Network: To validate the effectiveness of EAFPN, we add EAFPN to the baseline. Table VI shows that model B increases mAP by 2.18% and F1 by 1.95%. The main reason is that EAFPN can pay more attention to and locate waste with blurred boundaries, which improves the feature localization capability of the model.

E. Qualitative Analysis
To better explain the detection performance and investigate potential factors influencing results, we use gradient-weighted   1) True Positive Analysis: Fig. 11 shows an example of true positive samples. In the first image, the waste appearance features are obscured, and the targets are surrounded by chaotic and complex backgrounds. However, the heat map highlights the detected area. The second image shows that the waste has blurred boundaries, irregular shape, and the extracted features contain more noise, making detection difficult. However, SWDet can also accurately detect waste boundaries, indicating that SWDet can not only distinguish feature information of the detected object from complex backgrounds, but also efficiently locate the scattering features of irregular shape waste, solving the problem of blurred target boundaries.
2) False Positive Analysis: Fig. 12 shows an example of false positive samples. In the first image, the model marks a false detection. We suspect that it may be that the image is blurred and the color features are similar to white gravel. In the second image, the model marks two false detections. We find that the confidence in these two locations is low. Grad-CAM does not highlight this region, indicating that the model does not extract too much feature information from this region. In the future, we consider improving the spatial resolution of the image and adding more samples with color features similar to waste to enhance the robustness of the model.
3) False Negative Analysis: Fig. 13 shows an example of false negative samples. As can be seen from the figure that Grad-CAM does not highlight in the missed area. We suspect that missing detections are caused by inconspicuous appearance features of the waste, and the extraction capability of ADA is still insufficient, resulting in the target being labeled as negative samples by the model. In the following work, we will continue to improve the backbone according to the features of this waste.

A. Feature Map Visualization
We visualize the decision-making processes to better explain how models make decisions, as shown in Fig. 14. The heat map highlights the extraction area, and brighter colors indicate higher activation values in the visualization results. In Fig. 14(a), the target boundaries in the feature map of YOLOv5s are burred, and the extracted feature information is not obvious. However, the target contour and location information in the feature map of SWDet are brighter and more accurate than YOLOv5s [see Fig. 14(b)]. The visualization results show that SWDet can improve waste detection performance in high-resolution aerial images by enriching the target location information and weakening background noise.

B. Model Method
Table VII shows the experimental results of our proposed method on YOLOv5 models with different sizes, including YOLOv5n (nano), YOLOv5s (small), and YOLOv5m (medium). It shows that the proposed method improves performance on YOLOv5 models with different sizes. Specifically, mAP has been improved by 4.30%, 3.25%, and 3.41%, respectively, indicating the effectiveness of our proposed method. We also plot the AP-IoU curve and different-metric curve, as illustrated in Fig. 15. It can be seen that after using our method, the models improve in all indicators, indicating that SWDet can effectively improve detection performance.

C. Attention Mechanism
As illustrated in Table VIII, we compare ERFA with other attention mechanisms to demonstrate the superiority of ERFA. It can be seen that ERFA outperforms SE, CBAM, and CA by   1.24%, 1.83%, and 1.38%, respectively. This is because ERFA breaks down the two-dimensional global pooling operation into two one-dimensional coding processes, allowing the network to better capture the location information and channel relationship from the waste. Meanwhile, it introduces multibranch structure with different receptive fields to obtain waste spatial information coding, achieving the best performance.

VI. CONCLUSION
In this article, an anchor-based object detector for solid waste in aerial images (SWDet) is proposed to address the shortcomings and deficiencies of traditional methods for SW detection.
First, ADA is used to extract inconspicuous waste features. Second, we construct EAFPN that can capture scattering features to solve the problem of waste blurred boundaries. In the public waste dataset TACO, mAP reaches 63.12%, which is 2.45% higher than the baseline. We also collect and construct SWAD based on long-distance with mAP of 77.58%, which is 3.25% higher than the baseline. Experimental results show that SWDet can improve waste detection performance in high-resolution aerial images. However, SWDet still has the phenomenon of missed detection and false detection. In the future, we consider refining categories and designing a network with stronger feature extraction capability. Furthermore, experiments on waste datasets of scene classification and image segmentation are also planned to further investigate the feasibility of CNN, so as to optimize the detection performance of SW.