Solid Waste Detection in Cities Using Remote Sensing Imagery Based on a Location-Guided Key Point Network With Multiple Enhancements

Solid waste is a widespread problem that is having a negative effect on the global environment. Owing to the ability of macroscopic observation, it is reasonable to believe that remote sensing could be an effective way to realize the detection and monitoring of solid waste. Solid waste is usually a mixture of various materials, with a randomly scattered distribution, which brings great difficulty to precise detection. In this article, we propose a deep learning network for solid waste detection in urban areas, aiming to realize the fast and automatic extraction of solid waste from the complicated and large-scale urban background. A novel dataset for solid waste detection was constructed by collecting 3192 images from Google Earth (with a resolution from 0.13 to 0.52 m), and then a location-guided key point network with multiple enhancements (LKN-ME) is proposed to perform the urban solid waste detection task. The LKN-ME method uses corner pooling and central convolution to capture the key points of an object. The location guidance is realized through constraining the key point locations situated of the annotated bounding box of an object. Multiple enhancements, including data mosaicing, an attention enhancement, and path aggregation, are integrated to improve the detection accuracy. The results show that the LKN-ME method can achieve a state-of-the-art AR100(the average recall computed over 100 detections per image) of 71.8% and an average precision of 44.0% for the DSWD dataset, outperforming the classic object detection methods in solving the solid waste detection problem.


I. INTRODUCTION
W ITH the development of urbanization, solid waste pollution has become a significant environmental problem, and has even been considered as an "icon of the Anthropocene" that cannot be ignored. Solid waste refers to those things that have lost their use value or have been abandoned by humans Manuscript  during the process of industrial production, daily life, or other activities. Unordered and exposed piles of solid waste can lead to material deterioration, methane, and carbon dioxide emission, and leachate land contamination, threatening the living environment of humans [1], [2]. Therefore, the monitoring and management of solid waste is essential for cities, where more than half of the world's population live [3]. In Europe, the extremely rapid growth of industrial activity in the twentieth century has resulted in a dramatic increase in solid waste volume, and has led to the creation of numerous landfill sites [4], [5].
Along with the progress of human society, many countries and organizations have tried to establish sustainable systems to prevent or reduce the adverse effects of waste processing and disposal on the environment [1]. The United Nations made 17 sustainable development goals to promote prosperity while also protecting the planet. The achievements of many goals, such as clean water and sanitation (goal 6), sustainable cities and communities (goal 11), and responsible consumption and production (goal 12), require the support of rational and tight management of solid waste. Nowadays, both developed and developing countries are experiencing rapid changes in every aspect, including industrial production and social and economic development, during which various amounts of solid waste are being constantly produced. There is therefore an urgent need to search for and locate urban solid waste, to realize appropriate environmentally friendly management [1], [6]. However, the distribution of solid waste is usually random and scattered in cities, causing great difficulties for its detection and location. An efficient and accurate solid waste detection method will be very important for the management of the urban environment. Remote sensing data are widely used for the monitoring and planning of the urban environment, which can be attributed to their advantages of the large scale and the continuous periodic observation. Remote sensing data have also been used in waste monitoring and management since the 1990s, when aerial remote sensing technology was developed and large-scale photographs of land surfaces were able to be captured. It has been verified that the use of remote sensing data is a very effective and economical way to detect urban waste, as long as the spatial resolution of the data is high enough. A number of studies have been conducted to identify solid waste landfills and the areas around them using aerial remote sensing images [1], [4], [7]. In the beginning, simple visual interpretation was This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ used to identify the solid waste landfills. Subsequently, pattern recognition methods based on spectral signatures were proposed to identify the polluted vegetation areas around the solid waste landfills and evaluate the environmental effects [1], [4], [7]. It can be seen that these previous works have mainly focused on the identification of waste landfills with large areas by the use of visual or simple machine interpretation methods, limited by the data resolution and interpretation ability. Scattered solid waste is rarely discussed, even though this waste occupies available land and poses a threat to the environment. Currently, with the support of high-resolution remote sensing imagery and advanced pattern recognition technology, it should be possible to carry out automatic and intelligent solid waste detection in cities. Therefore, in this article, our aim is to develop an automatic object detection network for solid waste in cities based on deep learning with high-resolution remote sensing images.
Deep learning [8] is a new branch of machine learning, which has shown strong information processing capabilities in computer vision, speech recognition, artificial intelligence and many other fields [9]. It has also been widely used to solve the problems, such as classification [10], [11], [12], segmentation [13], [14], object detection [15], [16], [17], [18], change detection [19], etc., obtaining state-of-the-art results. Solid waste can be considered as a kind of ensemble object in urban areas, with typical spectral and spatial characteristics in remote sensing images, so that it is possible to realize detection by constructing an object detection network. We discovered two distinguishable characteristics of solid waste after interpreting many remote sensing images: solid waste is usually a mixture of multiple substances with varying spectral features; and solid waste is usually dumped in piles with blurry boundaries. In order to fully understand the features of solid waste, we built a novel dataset for solid waste detection (the DSWD dataset) by collecting 3192 high-resolution remote sensing images from Google Earth. Multiple types of solid waste with varying spatial and spectral features are included in the DSWD dataset, based on which a network can be trained to learn the features of solid waste in both shallow and deep layers. A location-guided key point network with multiple enhancements (LKN-ME) is also proposed to perform the task of urban solid waste detection. Multiple enhancements, including data enhancement, feature enhancement, and path enhancement, are also integrated to ensure the detection accuracy of the network.
The main contributions of this article can be summarized as follows.
1) A remote sensing image dataset comprising two categories of solid wastes (i.e., black wastes and white wastes) is constructed, in which the scene is more complex and numbers of targets are included compared with the current dataset. 2) A new key point network with multiple enhancements named LKN-ME is proposed for the solid waste detection, and higher detection accuracy is achieved than six state-of-the-art one-stage and two-stage networks. The rest of this article is organized as follows. Section II reviews the deep learning-based object detection networks and datasets. The details of the DSWD dataset are provided in Section III, and the proposed LKN-ME method is described in Section IV. Section V reports the details of the experiments, as well as the comparison with the existing methods, and the results of the ablation experiments are also presented. Finally, the article is concluded in Section VI.

A. Deep Learning Based Object Detection Networks
Deep learning was first used by Girshick et al. [20] to solve the object detection problem, and ever since then, many state-of-theart deep learning object detection networks have been proposed. These networks can be divided into two categories, according to the requirement of the region proposal: one-stage methods and two-stage methods. The one-stage methods can obtain the locations and the categories of the objects simultaneously, without region proposal. Typical examples of the one-stage methods are the single shot multibox detector (SSD) [21], you only look once (YOLO) [22], [23], [24], [25], CornerNet [26], CenterNet [27], and PolarMask [28]. The two-stage methods first obtain numerous region proposals and then identify the precise object locations and categories through fine-tuning and classification. Such methods include the region based convolutional neural network (R-CNN) [20], fast R-CNN [29], faster R-CNN [30], the feature pyramid network (FPN) [31], and mask R-CNN [32]. For both the one-stage and two-stage approaches, the networks for predicting the target boundaries involve two main strategies: anchor-based methods and anchor-free methods. The networks based on anchors, such as faster R-CNN [30], have guidance about the size of the bounding boxes, to achieve better detection effects. The anchor-free methods [20], [25], [26], [27], [28], [29], [30], such as CornerNet [26], omit the design of the size of the bounding boxes. As a result, they are much faster than the anchor-based methods, and are suitable for objects with large scale variations and no definite aspect ratio. In view of these advantages, the anchor-free approach was used in the latest YOLOX [33] method, achieving a first-class detection effect.

B. Object Detection in Remote Sensing Images
Most of the current object detection networks were originally proposed for detecting individual objects in front-viewed natural images covering simple scenes. However, the detection precision is often not satisfactory if these networks are directly used in remote sensing images, as remote sensing images have significantly different characteristics, compared with natural images. Four significant characteristics of objects in remote sensing images can be summarized as follows: various scales; various length-width ratios; blurred boundaries; and complex backgrounds. Solid waste often has a fragmentary appearance, and there are also some objects with characteristics that are similar to those of solid waste, such as groves, parking lots with lots of cars, waves on water surfaces, algae, and windows of tall buildings. These similar objects make the background more complex.
Faster R-CNN, SSD, and YOLO have been shown to be effective in aerial image object detection. However, these methods cannot effectively solve the problems of remote sensing images. Some networks have been proposed to solve the problems of the multiple scales and oriented bounding boxes in aerial images. For example, the rotation-invariant convolutional neural network [34] adds a rotation-invariant layer to the R-CNN architecture to deal with the problem of object rotation variations; the rotationinvariant and Fisher discriminative CNN [35] uses an oriented region proposal network to replace the region proposal network in faster R-CNN; and Fu et al. [36] introduced an oriented region proposal network and orientation region of interest to replace the FPN region proposal network and region of interest. Besides, a shape robust anchor-free network [18] has been proposed for the detection of garbage dumps by generating the targets' bounding boxes, indicating the effects of the anchor-free networks.

C. Datasets for Object Detection
The deep learning object detection task must be driven by datasets. Image datasets composed of large amounts of natural images captured by cameras or cell phones have been constructed for the general object detection tasks in daily life, including the Pascal visual object classes dataset [37] and the MS COCO dataset [38]. Similarly, many datasets composed of remote sensing images have also been constructed to support remote sensing object detection, such as NWPU VHR-10 [39], the vehicle detection in aerial imagery dataset [40], DOTA [41], DIOR [42], and the large-scale dataset for instance segmentation in aerial images [43]. These remote sensing datasets are mainly used for the detection of the typical objects and landscapes in cities, including vehicles, squares, airports, parks, etc. Compared with natural image datasets, remote sensing image datasets can support large object detection in a city from the top view, but the object labels can be very difficult to assign, as the background is complicated and the shapes and scales of the objects vary a lot. For a specific object detection task, a specific dataset is required to drive the learning process. A garbage dumps dataset has been built for the solid waste detection [18], but the data volume is not big enough and most of the scenes are too simple to reflect the true spatial distributions of solid wastes in cities. Therefore, in this article, a new image dataset containing complex scenes and multi-types of garbage for the solid waste detection task was built.

III. IMAGE DATASET FOR SOLID WASTE DETECTION
Solid waste shows very complex characteristics in many aspects, including color, texture, and shape, and the distribution of solid waste in remote sensing images is often very sparse. These characteristics result in the production of solid waste datasets being very difficult. As a result, there are currently no public remote sensing image datasets for solid waste detection, which limits the application of deep learning in this field. In order to solve these problems, we built the novel DSWD dataset.

A. Construction of the DSWD Dataset
The complex features of solid waste brought a lot of problems to the annotation of the DSWD dataset. First, solid waste is usually a mixture of multiple substances which have different colors, textures, and shapes, such as metals, plastics, and rubble. In remote sensing images, solid waste piles are often irregular, and the color of solid waste varies a lot, which leads to huge differences between examples of solid waste. It is therefore a big challenge to make full use of the common features of solid waste and annotate solid waste appropriately. Second, in remote sensing images, solid waste is often dumped in piles with blurry boundaries. There can also be many scattered solid waste objects around the main solid waste pile. It is therefore difficult to judge whether solid waste is one object or more. Fig. 1 shows some examples of solid waste in remote sensing images, where it can be seen that the solid waste is irregularly piled and fuzzy, with uncertain shape and color. There are also many scattered solid waste objects around the main solid waste piles. To deal with the issues mentioned above, we developed corresponding annotation standards.
First, the solid waste piles are usually circular or oval, for which a horizontal bounding box is appropriate. A horizontal bounding box can be described as (x c , y c ,w, h), wherex c ,y c denote the center of the bounding box and w, h denote the width and height of the bounding box. Second, irregular stacking is used as the main basis for judgment, and the surrounding environment is used as an auxiliary basis for judgment, to determine the locations of solid waste. In order to guide the detection results, we classify solid waste into two main classes-white and black-according to its main color and brightness. Third, when annotating the solid waste, we ignore the small scattered solid waste objects and the small intervals between the solid waste, because this can affect the judgment of the bounding boxes. Fig. 1 displays some examples of solid waste annotation. Irregular stacking is the main judgment for solid waste, as shown in Fig. 1(a) and (b). When the features of adjacent solid waste objects are different, we judge that the solid waste is made up of multiple independent objects, as shown in Fig. 1(c). When the features of adjacent solid waste objects are similar, we judge that the solid waste is a single object, as shown in Fig. 1(d) and (e). When solid waste has multiple colors, the solid waste is classified according to its main color, as shown in Fig. 1(f). In totally, the first row in Fig. 1 shows images containing black solid waste, and the second row shows images containing white solid waste.
According to the above standards, we built the novel DSWD dataset by collecting high-resolution remote sensing images, for which the details are given in Table I. In total, 3192 images with the size of 512 × 512 are included in the DSWD dataset, in which 2590 images contain solid waste objects and 502 contain objects that are similar to solid waste. We added objects that are similar to solid waste into the DSWD dataset for the reason that some objects in remote sensing images have similar features to solid waste, such as groves, parking lots with lots of cars, waves on water surfaces, algae, and windows of tall buildings. All the images were collected from Google Earth, with resolutions from 0.13 to 0.52 m. The data quality in Google Earth varies, so we mainly chose areas with a good quality to collect the solid waste data. To increase the generalization of the DSWD dataset, we collected images from 32 major cities in China. Fig. 1 shows some examples from the DSWD dataset.

B. Dataset Splits
We randomly selected 70% of the images as the training set, 10% as the validation set, and 20% as the test set. We therefore obtained a training set with 2233 images, a validation set with 320 images, and a test set with 639 images. The training set was used to train the network. The validation set was used to verify the network performance in training. The test set was used to test the network effect after training and to perform unbiased evaluation of a trained model.

IV. METHOD
In this article, we developed the LKN-MEs for solid waste detection. The LKN-ME method uses corner pooling and central convolution to capture the key points of the object bounding boxes. The location guidance is designed to give the network a rough location by taking advantage of the cover area of the annotated bounding boxes of the objects in the dataset. Multiple enhancements, including data mosaicing, an attention enhancement, and path aggregation, are integrated to enhance the data and features in multiple layers and scales. Fig. 2(a) shows the overall architecture of the proposed LKN-ME method.

A. Network Architecture
A flexible bounding box is required to detect solid waste precisely because of the irregular shape of solid waste. The key point network is anchor-free and can provide candidate bounding boxes efficiently and with a high quality. Thus, a key point network is taken as the baseline of the proposed network. Both corner points and center points are detected by the proposed network. The corner pooling module is used to calculate the top-left corners and the bottom-right corners of the bounding boxes, and the central convolution module composed of two convolutional layers and a sigmoid activation function is used to calculate the center points of the bounding boxes. The extreme points of the objects at the top, left, bottom, and right directions are expected to be located by the corner pooling module, which are actually out of the objects. This key point searching mechanism makes the detector pay more attention to the range and outside shape of the object rather than the complicated features of the object itself, which benefits solid waste detection.
An hourglass network [44] is taken as the backbone of the proposed network, from which three heatmaps are produced, i.e., the top-left corner heatmap, the center point heatmap, and the bottom-right corner heatmap. The pixel values of the heatmaps are confidence scores representing the possibilities of the key points of the objects. In addition, the embeddings for the corners, the offsets for the corners, and the center points are also predicted. The embeddings are used to identify whether a top-left corner and a bottom-right corner are from the same object. The offsets are applied to remap the heatmaps to the size of the input image. In order to generate the object bounding boxes from the heatmaps, the top k key points are selected according to their scores. Three standards are then used to match the top-left corners and the corresponding bottom-right corners: the x and ycoordinates of the bottom-right corner should be larger than those of the top-left corner; the center point exists around the midpoint of the top-left corner and the bottom-right corner; and the distance of the embeddings of the top-left corner and the bottom-right corner is less than a threshold. If the top-left corner and the bottom-right corner meet these three standards, they are matched as a predicted bounding box of an object.

B. Location Guidance
Bounding boxes are used to annotate the location and category of the objects in the object detection task. However, the location information is not fully exploited in most networks. We propose to use the coverage area of the labeled bounding box to achieve the location guidance in the network.
According to the labels of a bounding box in the dataset, a binary mask is generated as the guidance, in which the pixels within the bounding box are annotated as 1, while those outside of the bounding box are annotated as 0. In this way, a coarse localization criterion is provided by the binary mask guidance, to guide the training of the LKN-ME method. Fig. 2(b) shows the structure of the location guidance, which is an independent branch in the proposed network. Three convolutional layers and a sigmoid function are connected in the location guidance to obtain the coarse localization heatmap of the objects. The kernel sizes of all convolution layers are 3 × 3. Focal loss is used to calculate the difference between the heatmap and the groundtruth map. Since the heatmap depicts the coarse localization of the objects, element-wise addition is used to add the heatmap back to the LKN-ME network, to make full use of the location information.

C. Multiple Enhancements 1) Mosaic Data Augmentation:
It is difficult to manually select a large number of images containing solid waste from high-resolution urban remote sensing images, which greatly limits the amount of data in the DSWD dataset and leads to low robustness of the LKN-ME method. Therefore, mosaic data augmentation [25] was used to expand the solid waste dataset.
Data mosaicing is a method that can be used to mix four images to generate an image that contains the scenes of all four images. Before an n × nimage is input into the network, three images are randomly selected from the training set. The four images can then be spliced into an 2n × 2nimage. We then randomly pick a point from the central n × n area of the composite image and take this point as the center point to crop an n × n image, as illustrated in Fig. 3(a). The n × n image is then input into the network for training. The robustness of the LKN-ME method can be significantly improved because of the data mosaicing. There are two reasons for this: there are different combinations of scenes in an image, which enriches the scenario of the training data; and since an object can be sliced by the mosaicing, a part of the object rather than the whole object is input into the network.
2) Path Aggregation: The hourglass network can obtain features [P 1 , P 2 ,P 3 , P 4 ,P 5 ] with five scales. However, only feature P 5 is used to obtain the final result, while the features with other scales are not fully used, Remote sensing images are often of multiple resolutions and the object scales are usually varied. Thus, a network for solid waste detection should be capable of dealing with multiresolution remote sensing data and mining multiscale features. Therefore, a path aggregation module composed of a top-down path and a bottom-up path is proposed in the LKN-ME method, to explore more multi-scale features and support solid waste detection. Fig. 3(b) displays the structure of the path aggregation. For the top-down path, each feature P i (i = 1, 2, 3, 4, 5) goes through a 3 × 3 convolutional layer to obtain feature T i (i = 1, 2, 3, 4, 5). Each feature T j (j = 1, 2, 3, 4, 5) then goes through a 3 × 3 convolutional layer with stride 2 to reduce the spatial scale. Finally, each feature map T j+1 and the down-sampled map are added using element-wise addition to obtain the features [P 1 , P 2 , P 3 , P 4 , P 5 ], where P i denotes the feature generated by the top-down path. For the bottom-up path, each feature P i (i = 1, 2, 3, 4, 5) goes through a 3 × 3 convolutional layer to obtain feature T i (i = 1, 2, 3, 4, 5). Each feature T j (i = 1, 2, 3, 4, 5) then goes through an up-sampling layer to increase the spatial scale. Finally, each feature map T j+1 and the upsampled map are added using element-wise addition to obtain the features [P 1 , P 2 , P 3 , P 4 , P 5 ], where P i denotes the feature generated by the bottom-up path. P 1 is the output of the path aggregation.
3) Attention Enhancement: An attention enhancement is used to make the network focus on the important features of the objects of interest and suppress unnecessary ones. In the proposed approach, the attention enhancement is divided into spatial attention and channel attention enhancement. The structure of the attention enhancement is shown in Fig. 3(c).
To compute the channel attention efficiently, we first use global average pooling to obtain a 1 × 1 × C feature map. We then apply two fully connected layers and a sigmoid function after the last fully connected layer to produce a channel attention map M C (F ) ∈ R 1×1×C . The channel attention enhancement focuses on which channel contains meaningful information about the objects. The spatial attention enhancement is used to obtain the inter-spatial relationship of the feature map, and it focuses on the location of the objects. To compute the spatial attention, two convolutional layers and a sigmoid function after the final convolutional layer are used to generate a two-dimensional spatial attention map M s (F ) ∈ R W ×H .

D. Loss Function
According to the above description, the heatmaps of the key points are produced by the proposed network, as well as the offsets of the key points, the embeddings of the top-left corners and bottom-right corners, and the heatmaps of the coarse localization. Thus, the loss function of the proposed network consists of five parts, expressed as L = L det + aL pull + bL push + cL off + dL localization where L is the loss function of LKN-ME. L det and L localization are both a variant of focal loss, and L det was introduced in CornerNet [26]. L pull is used to minimize the distance between the top-left and bottom-right corner embeddings which belong to the same objects, and L push is used to maximize the distance between the top-left and bottom-right corner embeddings which belong to different objects. L off is the smooth L1 loss, which is applied at the ground-truth corner locations. a, b ,c, and d denote the weights of the corresponding losses, which are set to 0.1, 0.1, 0.1, and 0.1 in the proposed network.

V. EXPERIMENTS
The experiments were conducted in PyTorch. An RTX 2080Ti (11 GB) GPU was used to accelerate the calculation. In addition to adopting standard data augmentation techniques, including random horizontal flipping, random cropping, random coloring, and random scaling, mosaicing was used for further data augmentation. The Adam optimizer [45] was used to optimize the full training loss, and a rectified linear unit was used as the activation function. The batch size of the network was set to 5 and the maximum number of epochs was set to 180. The learning rate was 2.5 × 10 −4 for the first 150 epochs and then 2.5 × 10 −5 for the last 30 epochs.
In the testing, we selected the top 70 top-left corners, center points, and bottom-right corners from the heatmaps to detect the bounding boxes. The score of each bounding box was the average of the key points of this bounding box. Soft-NMS [46] was used to remove the redundant bounding boxes. The threshold of soft-NMS was set to 0.5. We finally selected the top 100 bounding boxes, according to the scores of these boxes, as the final detection results. The annotation of each bounding box was solid waste rather than white solid waste or black solid waste. Multiscale testing with 0.6, 1, 1.2, 1.5, and 1.8 times the resolution of the input image was used to detect the objects.

A. Evaluation Metrics
In this article, the average recall (AR) computed over 100 detections per image (AR 100 ), the average precision (AP), and  TABLE II  PERFORMANCE OF THE PROPOSED LKN-ME METHOD AND THE OTHER STATE-OF-THE-ART METHODS AP 50 are used as the metrics for the solid waste detection, where AP is computed over the maximum 100 detections per image and the average of 10 different intersection over union (IoU) thresholds from 0.5 to 0.95 in 0.05 intervals. AP 50 is computed over the maximum 100 detections per image with an IoU threshold of 0.5. AR 100 is the AR computed over the maximum 100 detections per image and the 10 different IoU thresholds. As the purpose of solid waste detection is to identify as much solid waste as possible, AR 100 is used as the main evaluation metric. In addition, different object scales of small (area < 32 2 ), medium (32 2 < area < 96 2 ), and large (area > 96 2 ) are used to calculate the AP and AR, to evaluate the performance of the networks. All the AP and AR metrics are computed over the maximum 100 detections per image and the 10 different IoU thresholds.

B. Detection Results
The last row in Table II lists the results of the LKN-ME method obtained with the DSWD dataset. The LKN-ME method achieves 71.8% AR, 44.0% AP, and 70.3% AP 50 on the DSWD test dataset, which is the best result among all seven detectors. The AP and AR values for the large solid waste objects are better than those for the medium and small solid waste objects. The reason is that there is more information in the large solid waste objects, so it is easier to obtain the centers of the large solid waste objects. Fig. 4 shows some results of the LKN-ME method for the DSWD test dataset. The first row shows the detection of single solid waste objects, where the location of the solid waste detection is very accurate and the candidate bounding boxes contain the solid waste completely. In the second and third rows, there are multiple solid waste objects, and the solid waste in the images is basically all detected. However, the solid waste in the fourth-row scenes is fuzzy and mixed with various substances. The detection effect declines in these scenes, and some solid waste is not detected in the fourth-row images. The fifth row displays objects Fig. 4. Results of the proposed LKN-ME method with the DSWD test dataset. The first row shows images with single solid waste objects. The second to fourth row show images containing multiple solid waste objects. The fifth row shows images with objects that are similar to solid wastes.
with similar properties to solid waste. The objects from left to right are algae, waves on the surface of water, a grove, a parking lot with lots of cars, and windows of tall buildings. None of these objects are mistakenly detected as solid waste. Generally speaking, the overall test results for the LKN-ME method are good for both the solid waste objects and the objects similar to solid waste, but the detection effect does decrease when the features of the solid waste are complex.
From Table IV, it can be seen that the AP and AP 50 of the results based on the dataset with two categories are almost  consistent with that with one category. However, the AR 100 is improved by 5.1%, changing from 66.7% to 71.8% in the results based on two categories. It strongly demonstrates that the dataset with two categories promotes the performance of the solid waste detection by using deep learning networks. Table II is a comparison of the proposed network with six state-of-the-art detectors on the DSWD test dataset, i.e., faster R-CNN, FPN, PANet, YOLO, CornerNet, and CenterNet, where faster R-CNN, FPN, and PANet are two-stage networks, and the other three detectors are one-stage networks.

C. Comparison Experiments 1) Accuracy Comparison With the State-of-the-Art Detectors:
In order to conduct peer-to-peer experiments between different networks, the backbones of all the networks were set to around 50 layers. Nine metrics were calculated for each method. LKN-ME obtains the best results in eight of the nine metrics, with an AP of 44.0% and an AR 100 of 71.8%, which outperforms the other detectors. Faster R-CNN, FPN, and PANet use anchors with three scales and three sizes to detect objects. These two-stage methods show a poor performance on the small targets, resulting in a low overall detection accuracy.YOLOv4 has advantages in small-object detection and obtains the best AP small score of 38.8%. Because multi-scale fusion is adopted in YOLOv4, the accuracy of the small-scale prediction is higher. CornerNet obtains a high average maximum recall but a low AP on the DSWD dataset. The reason for this is that the complex boundaries of the solid waste objects mislead CornerNet to detect many false bounding boxes. CenterNet adds center point detection to CornerNet to strengthen the constraints on the key point matching. CenterNet achieves remarkable improvements, from 26.5% to 40.7% in AP, from 43.6% to 65.7% in AP 50 , and from 65.1% to 70.2% in AR 100 , which means that using deep learning to detect the key points of bounding boxes to achieve solid waste detection is effective. This method pays more attention to the boundaries of the objects, which weaken the feature differences inside the objects, and it more easily extract the unified features of solid waste.
Compared with CenterNet, the proposed LMN-KE method shows certain improvements in eight metrics, except for AP small . The AR 100 is increased from 70.2% to 71.8%, the AP is increased from 40.7% to 44.0%, and the AP 50 is increased from 65.7% to 70.3%. In addition, AR small is increased by 6%, AR medium is increased by 1.3%, AR large is increased by 1.7%, AP medium is increased by 6.2%, and AP large is increased by 2.3%. However, AP small is decreased by 9.6%. The detection accuracy of the LKN-ME method is reduced on small objects. This may be because the path aggregation fuses low-scale features, which blur the features of the small objects. As a result, LKN-ME detects too many small objects, resulting in a decrease in AP small .
2) Training Speed Comparisons: In the key point network, the corner pooling module and the central convolution module are used to calculate the corners and the center points of the bounding boxes. Table III gives a comparison between the proposed key point network and other key point networks. For solid waste detection, the training speed of CornerNet is fast but the detection accuracy is poor. Solid waste has a lot of edge information, which makes CornerNet detect many false bounding boxes. CenterNet is better able to detect solid waste but the training speed is slow. CenterNet detects the center points of the bounding boxes to decrease the false bounding boxes, which results in a huge improvement in detection accuracy. However, CenterNet uses corner pooling repeatedly, which is a slow calculation method when backpropagating. As a result, the training speed of CenterNet is slow.
The proposed key point network detects the center points of the bounding boxes to improve the detection accuracy, and decreases the use of corner pooling to improve the training speed. The proposed key point network achieves the detection accuracy of CenterNet and the training speed of CornerNet, which results in an improved overall performance.
The LKN-ME method adds the mosaic data augmentation, the attention enhancement, path aggregation, and location guidance to the key point network. As a result, the training speed increases by 0.2 s/iter, from 1.5 a/iter to 1.7 s/iter, the AR 100 is improved by 1.9%, from 69.9% to 71.8%, the AP is improved by 3.1%, from 40.9% to 44.0%, and the AP 50 is improved by 3.6%, from 66.7% to 70.3%. Therefore, the LKN-ME method obtains a superior detection accuracy, with only a slight increase in the training cost.

D. Ablation Study
The proposed LMN-KE method consists of four components, i.e., mosaic data augmentation, the attention enhancement, path aggregation, and location guidance. An ablation study was conducted to analyze the contribution of each individual component. The backbone of each experiment was Hourglass-52. We conducted the ablation study with a variety of combinations of the four components. The results are listed in Table V, where the first row is the result of the proposed key point network.
To demonstrate the importance of the mosaic data augmentation, the results of the network with and without mosaicing were compared. As shown in the first two rows in Table V, the mosaicing results in an improvement in AR 100 of 0.9%, from 69.9% to 70.8%, an improvement in AP of 2%, from 40.9% to 42.9%, and an improvement in AP 50 of 1.1%, from 66.7% to 67.8%. These results demonstrate that the mosaic data augmentation is an effective way to improve the detection effect.
The third row shows the result of adding location guidance to the key point network. The location guidance results in an improvement in AR 100 of 1.2%, from 70.8% to 72.0%, an improvement in AP of 0.2%, from 42.9% to 43.1%, and an improvement in AP 50 of 1.2%, from 67.8% to 69.0%. Therefore, it is confirmed that the location guidance is an effective way to improve the detection effect for solid waste, in both AP and AR.
To verify the effectiveness of the attention enhancement, the attention enhancement was added to the key point network. The fourth row in Table V shows that the attention enhancement can improve the AR 100 greatly, from 69.9% to 71.5%, but it decreases the AP, from 40.9% to 40.3%, and also the AP 50 , from 66.7% to 65.6%. Because the task is to detect solid waste as much as possible, the attention enhancement is a suitable way to improve the AR of the proposed network.
However, the results listed in the fifth row show that using both mosaic data augmentation and the attention enhancement can improve all three-evaluation metrics, compared with the second row, but AR 100 decreases a little, from 71.5% to 71.1%, compared with the fourth row. The reason for this is that the attention enhancement can enhance the detection effect for some unobvious objects, such as small objects, and the mosaic data augmentation generates more small objects to input into the training network. Therefore, using the combination of both mosaic data augmentation and the attention enhancement improves the solid waste detection effect. However, the mosaic data augmentation splits up the bounding boxes so that some large objects are not all input into the network. This reduces the detection effect for large objects, which leads to the decrease in AR.
The sixth row in Table V shows the effect of path aggregation being added into the network. Compared with the fourth row, the path aggregation results in an improvement in AR 100 of 0.6%, from 71.1% to 71.7% and in improvement in AP of 0.3%, from 43.2% to 43.5%. This suggests that the path aggregation, which merges features at different scales, can improve the detection effect of the network slightly. The reason for this is that the path aggregation merges the features of a small scale, which blurs the location information.
The seventh row in Table V shows that the location guidance results in an increase in AR 100 of 0.3%, an increase in AP of 0.1%, and an increase in AP 50 of 0.1%, compared with the fifth row in Table V. We believe that the effect of the location guidance is similar to that of the attention enhancement, but the difference is that the location guidance is a supervised method and the attention enhancement is an unsupervised method, which leads to little improvement when this method is used directly.
However, when using both path aggregation and location guidance, as shown in the eighth row in Table V, compared with the fifth row, the AR 100 increases by 0.1%, from 71.7 to 71.8%, the AP increases by 0.5%, from 43.5% to 44.0%, and the AP 50 increases by 1.3%, from 69.0% to 70.3%. These results indicate that using both location guidance and path aggregation increases the AR slightly while increasing the AP significantly. This is not surprising because the path aggregation merges feature at different scales, but the fuzzy location information and the location guidance provide more accurate location information to the network. Therefore, the location guidance can compensate for the problem caused by the path aggregation.

VI. CONCLUSION
In this article, we have described how we built the DSWD in remote sensing images, which contains both a large number of solid waste scenes and some negative samples. An LKN-MEs is proposed for the urban solid waste detection task in remote sensing imagery. The LKN-ME method is a key point network integrating location guidance and multiple enhancements, including mosaic data augmentation, path aggregation, and an attention enhancement. The experimental results showed that the LKN-ME method can achieve state-of-the-art results of 71.8% in AR 100 and 44.0% in AP for the DSWD dataset, and it outperformed six other classical detectors. The ablation study verified the effect of each module of the proposed network. The mosaic data augmentation has the most obvious effect in enhancing the network performance. The attention enhancement allows the network to focus more on the regions of interest. The use of the path aggregation module and the location guidance can further improve the performance of the proposed network. Although data with different resolutions are included in the DSWD dataset, whether network training based on the DSWD dataset is appropriate for other data will need further verification in the future.