Superpixel-Based Convolutional Neural Network for Georeferencing the Drone Images

Information extracted from aerial photographs has been used for many practical applications such as urban planning, forest management, disaster relief, and climate modeling. In many cases, labeling of information in the photo is still performed by human experts, making the process slow, costly, and error-prone. This article shows how a convolutional neural network can be used to determine the location of ground control points (GCPs) in aerial photos, which significantly reduces the amount of human labor in identifying GCP locations. Two convolutional neural network (CNN) methods, sliding-window CNN with superpixel-level majority voting, and superpixel-based CNN are evaluated and analyzed. The results of the classification and segmentation show that both of these methods can quickly extract the locations of objects from aerial photographs, but only superpixel-based CNN can unambiguously locate the GCPs.


I. INTRODUCTION
A ERIAL image interpretation finds applications in many diverse areas including urban planning, forest management, vegetation monitoring, and climate modeling [1], [2]. Standard GPS accuracy used to geotag image is usually in the order of few meters. This error causes a systematic shifting in the processed outputs far beyond the acceptable bound, which typically ranges between 1 and 5 cm. To improve the accuracy of the results, ground control points (GCPs) are evenly distributed across the area of interest and measured using high accuracy methods such as real-time kinematic (RTK) GPS. A fundamental step in aerial image georeferencing consists of marking the location of GCPs on the ground in the aerial photos as shown in Fig. 1. However, much of the work in identifying the locations of these GCPs in the photographs is still performed by human experts, which can be labor intensive with the explosive growth in data volumes.
Recently, several deep learning algorithms have emerged to automate the process of information retrieval from aerial images. These algorithms have been used successfully for object Manuscript  detection [3]- [5] and scene classification [6]. The impressive accuracy produced by these algorithms suggest that automated aerial image interpretation systems may be within reach [7], [8].
A convolutional neural network (CNN) is a special kind of multilayer neural network developed to classify highdimensional patterns [9], [10]. It does not require feature extraction, thus resulting in higher generalization capabilities. To improve the performance and accuracy of CNN, different structures have been developed. AlexNet has two parallel CNN lines trained on two GPUs with cross-connections. It has been shown that this replaces the sigmoid activation with a rectified linear unit (ReLu) activation [4]. Other architectures such as GoogLeNet introduces inception modules and increases the depth of the network [11]. To avoid zero gradients, ResNet presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously [12].
The goal of the CNN method in this article is to detect the locations of GCPs in the drone photos. These GCPs are only about 0.5 × 0.5 m 2 in size and are difficult to detect manually. The drone takes thousands of images per flight; so it takes days of manual labor to locate the GCPs in the photographs. These GCPs georeference the model at cm-level accuracy, allowing geomatics engineer to take accurate measurements of roads, buildings, and any other visible features. If these GCPs can be automatically detected and labeled, then this would result in significant savings in human labor.
The aerial image interpretation is usually formulated as a pixel labeling problem. Given an aerial image, the pixel labeling approaches classify each pixel to get a binary classification of the image for a single object class [13], [14] or a multiclass classification to get a complete semantic segmentation of the image into classes such as roads, grasses, and cars [15]. In our article, both cases are studied for detecting the objects in the aerial image.
The pixel labeling classification method alone is known to produce salt-and-pepper errors since it only includes sparse spatial information within the neighborhood of the pixel in the classification [2]. To overcome this problem, we propose sliding-window CNN with superpixel-level majority voting, which apply a simple linear iterative clustering (SLIC) and the density-based spatial clustering of applications with noise (DBSCAN) for image segmentation after classification.
The other method we propose is superpixel-based CNN. The GCP has a unique hourglass shape and its color is dark (see Fig. 1 as an example). To take advantage of these features, a superpixel-based classification method is applied to the aerial images. Compared with the pixel labeling classification, it can locate the GCP accurately. However, it does not perform well in classifying other objects.
This article is divided into four sections. After the introduction, the second section presents the theory of image classification and segmentation. The CNN architecture and the SLIC method are described. In the third section, the CNN methods used in the classification are introduced. The fourth section presents numerical test results in classifying objects in aerial images. The images are taken by a drone flying at an average elevation of 80 m and equipped with a high-resolution camera. The conclusions are given in the last section.

A. CNN Architecture
The aerial image classification problem is an image classification task, which classifies objects in image patches that are used as input data. An image patch which includes a GCP is labeled as a positive patch. Otherwise, it is labeled as a negative patch. We extract image patches containing a GCP with dimension 65 × 65 and downsample them into 32 × 32 patches.
The network takes these 32 × 32 patches as input and the classification label of the center point is the output. The network (see Fig. 2) is composed of five convolution (Conv) layers and two fully connected (FC) layers followed by a softmax classifier. There are eight convolution filters with size 2 × 2 in the first Conv layer. The number of channels is doubled in the latter convolution layers. Rectified linear activation (ReLu) and Max-pooling with size 2 × 2 are applied after each Conv layer. There are 128 feature maps in the first FC layer and we apply dropout to 50% of them in the second FC layer. The softmax classifier gives the probability of the center point being a reflection event or not. The probability of two classifiers are compared and the one with higher probability is selected as the prediction label.
All convolutional filter kernel elements are trained from the data in a supervised fashion by learning from the labeled set of training examples. A training set N = {x i , y i } is given which contains n image patches, where x i is the ith image patch and y i ∈ {0, 1} is the corresponding class label. Then the corresponding cross-entropy cost function is given by where p(Y = y i |x i ) is the probability that the label of x i is y i . After training, the network is tested with a validation set to check the success of training. If the network is trained properly, it will recognize and correctly classify only those cases included in the training set. For new conditions not included in the training set, they will be misclassified or not recognized.

B. Simple Linear Iterative Clustering
An SLIC algorithm is used to reduce the classification noise within segments. To expedite the classification of GCPs, we define a superpixel as a group of connected pixels with similar colors and gray levels. SLIC adapts a k-means clustering approach to efficiently generate superpixels in the CIELAB color space [16].
CIELAB is a three-dimensional integer space (L , a , b ) for digital representation of colors shown in Fig. 3, where L represents the lightness component. The lightness value L ranges from 0 to 100, where the darkest black at L = 0 and the brightest white at L = 100. a represents the green-red component, with green in the negative direction and red in the positive direction. b represents the blue-yellow color component, with blue in the negative direction and yellow in the positive direction. The a and b axes both range from -128 to 127. The clustering procedure of SLIC requires the following steps shown in Fig. 4.
1) In the first step, the input image is divided into K approximately equally sized superpixels. For an input image with N pixels, the approximate size of each superpixel is N/K. Hence, Fig. 3. CIELAB color space (adapted from [17]). L * for the lightness from black to white, a * from green to red, and b * from blue to yellow.
there is a superpixel center at every grid interval S = N/K as shown in Fig. 4(a). The coordinates for the initial superpixel cen- 2) If the initial center of a superpixel is placed on an edge or a noisy pixel, no pixels in the neighborhood belong to the same cluster with the center. To prevent the situation, the magnitude of the image gradient is computed as where I x,y is the CIELAB vector [L x,y , a x,y , b x,y , x, y] T corresponding to the pixel at position (x, y), and ||.|| is the L 2 norm. The calculation of the gradients takes into account both color and intensity information. If the center is located on the edge or a noisy pixel, the magnitude of the image gradient will be large. To move the center point away from the edge and the noise pixel, the magnitudes in the 3 × 3 neighborhood around the initial superpixel centers are calculated as shown in Fig. 4(b). Then the centers are moved to the locations corresponding to the points with the lowest magnitude.
3) The search region for each superpixel center has the area 2S × 2S shown in Fig. 4(c), where S is the grid interval. Each pixel i would be in the search regions of its surrounding superpixel centers and it is associated with the nearest one. The dark green area in Fig. 4(c) is composed of the pixels associated with the yellow superpixel center. Compared with the conventional k-means clustering that compares each pixel with all the superpixel centers, this method greatly speeds up the algorithm by limiting the size of the search region.
The distance measure D s between the pixel and superpixel center is defined as follows: where D s is the sum of the CIELAB distance d lab and the xyplane distance is normalized by the grid interval S. A variable m is introduced for controlling the compactness of a superpixel. 4) Once each pixel has been associated with the nearest superpixel center, an update step adjusts the superpixel centers to be the mean [L, a, b, x, y] T vector of all the pixels belonging to the superpixel shown in Fig. 4(d). The L 1 norm is used to compute a residual E corresponding to the new superpixel center locations 5) Repeat steps 3 and 4 until the residual E is less than the specified threshold.

III. CNN METHODS
Two approaches are used to solve the georeferencing problem. In the first method, SLIC is used as postprocessing with majority voting on CNN labels. In the second method, SLIC is used as preprocessing to extract the superpixels from the images for CNN classification.

A. Sliding-Window CNN With Superpixel-Level Majority Voting
The sliding-window CNN is a traditional neural network for solving certain classification problems [18]. It classifies an image by taking a pixel and a border padding of some size around it as input and classifies the center pixel by running the patch through a neural network. Then the next pixel is labeled by shifting the window patch by one pixel and classifying the neighboring pixel of the first one.
The main disadvantage of this approach is the impractical runtime. A 3500 × 5000 pixel 2 aerial image requires 1.75×10 7 classifications, which may cost hours for the computation. In order to reduce the computational time, a classification stride value of s is used. A stride of s pixels results in the center pixels having a distance of s to the next patch's center pixel. This increases the computation speed by a factor of s 2 . However, it will reduce the resolution of the classification results and amplify some salt-and-pepper errors.
To mitigate the salt-and-pepper problem, a region merging process is performed as a postprocessing method by segmenting the aerial image into superpixels using the SLIC algorithm. These superpixels are then processed using the DBSCAN algorithm to form clusters of superpixels to generate the final segmentation shown in Fig. 5. The details are shown in Appendix A. The CNN labels belonging to the same superpixels should have the same labels; so the label of each superpixel is decided by majority voting of the pixel labels that belong to it as shown in Fig. 6. For every superpixel, the number of pixels that belong to each class is counted. The class that has the highest number would be the dominate class in this superpixel. Then the   labels of all the pixels within this superpixel will be changed to the dominate class.

B. Superpixel-Based CNN
The GCP has a unique hourglass shape and its color is dark. To take advantage of these features in the classification, a superpixel-based method is proposed. The workflow of the superpixel-based CNN is as follows.
1) The input image is first partitioned into superpixels using SLIC as shown in Fig. 7. The initial size of the superpixel should be larger than the GCP so that the pixels of the GCP belong to the same superpixel.
2) The superpixels are interpolated to be the same size as one another and placed on a homogeneous background.
The color of the background should be different from the color of the target; otherwise, the superpixel will merge into the background and the shape of the superpixel cannot be preserved in the patch. In this article, white is chosen as the color of the background.
3) The patches with superpixels are used as the input to CNN and the label of the superpixel is the output. A 3500 × 5000 pixel 2 aerial image can be partitioned into 30 000 superpixels so that only 30 000 classifications are required for the classification of the whole image. The computational performance of this method is 10 4 faster than the sliding-window method without losing any resolution.
Different from the sliding-window CNN, DBSCAN is not used in this method although it may decrease the size of the input data. After applying SLIC to the image, most of the superpixels from non-GCPs have a hexagon-like shape as shown in Fig. 7. This is useful for distinguishing them from the hourglass shapes   of GCPs. If the DBSCAN is applied to the superpixels, the complexity of the superpixel shape will greatly increase. Then much more data are needed to make sure the training data contain all the different shapes of the superpixels.

IV. NUMERICAL TESTS
The CNN methods discussed above are applied to a 2-D aerial image shown in Fig. 8 to identify the GCP. The full image size is 3500 × 5000, and the size of the GCP is about 25 × 30. The number of image patches with and without the GCPs in this image is extremely unbalanced.

A. Binary Classification
To detect the locations of GCPs in the drone, photos can be interpreted as a binary classification problem. The architecture of CNN is illustrated in Fig. 2, and the aerial image patch is used as the input layer. The CNN parameters for classification are listed in Table I.

1) Sliding-Window CNN With Superpixel-Level Voting:
Sliding-window CNN with superpixel-level voting is first applied to the image. To build a balanced training set, 18 000 negative patches are randomly picked in this image and 18 000 positive patches are collected from other aerial images. 80% of the patches are used as the training set, and the others are used as the validation set.
After 50 epochs, both the accuracy of the training set and the validation set are around 99%. The CNN labels in the aerial image are shown in Fig. 9(a). The points in yellow and in blue denote the labels for the GPS GCPs and the non-GCPs. There are about 2.5 × 10 6 misclassifications in the result, especially from the pixels of sands, cars, and houses. After postprocessing by superpixel-level voting, the number of false positives is greatly reduced to 3.6 ×10 4 false positives as shown in Fig. 9(b). The labels of the GCP would be preserved after the voting shown in Fig. 10. However, it is still impossible to find the location of the GCP.
There are many types of objects in the aerial image, such as cars, houses, sand, roads, and so on. A proper training set should include most types of objects in the image. Although the training set we used contains patches with the GCP, it does not include enough samples with other objects. To improve the performance of the sliding-window CNN, more data should be included in the training set while the number of data examples for each class should be balanced.
2) Superpixel-Based CNN: Different from the slidingwindow CNN, superpixel-based CNN segments all the objects into superpixels so that the superpixels from all the non-GCP  objects have similar shape and color while the GCP has a unique hourglass shape and dark color. Then the dataset for the non-GCPs and GCPs can be easily balanced.
The whole image is segmented into approximately 30 000 superpixels using the SLIC algorithm and superpixel-based CNN is applied to the SLIC-processed data. Since the shape and the color of the GCP are known, there is no need to extract positive patches from the aerial images. Three thousand synthetic positive training examples are easily built by rotating, shrinking, and expanding a dark-color hourglass-shaped superpixel as shown in Fig. 11. Three thousand negative superpixels are picked randomly from the superpixels in the image. Only 10 epochs are needed to achieve better than a 99% accuracy for the training and validation examples. The trained CNN network is then applied to all of the superpixels in the image. Only five superpixels are classified as GCPs as shown in Fig. 12 and all of them have similar hourglass shapes and dark colors. The number of false positive is only 4, which is much less than 3.6 ×10 4 in the sliding-window CNN.
To remove the false positives, the pretrained sliding-window CNN can be applied to these five positive patches. All the false positive patches are then removed and the GCPs are correctly Synthetic positive patches created for superpixel-based CNN by rotating, shrinking, and expanding a dark-color hourglass-shaped superpixel. located. The same workflow is applied to the two new images shown in Fig. 12(b) and (c). The correct locations of GCPs in these two images can be easily obtained from the final positive patches, respectively.

B. Multiclass Classification
To test if these methods can be extended to the detection of other objects, multiclass classification CNN is computed with both methods. The architecture of multiclass CNN is similar to binary CNN, except that the output layer should be multiclass instead of only two labels. The loss function for multiclass is defined as where M is the number of classes, p is the predicted probability that the observation o is of class c, while b is the binary indicator (0 or 1) if class label c is the correct classification for observation o. The probability p is computed by applying a softmax operation to the output at each node of the CNN.  To obtain the training set for CNN, the aerial image is labeled manually as shown in Fig. 13. There are six classes in the aerial image: background, road, car, sand, house, and GCP. The CNN parameters for multiclass classification are the same as in Table I.

1) Sliding-Window CNN With Superpixel-Level Voting:
In the sliding-window CNN with superpixel-level voting, 6000 samples are picked randomly from each class in the manually labeled image. However, only 750 pixels belong to the GCP in this image; so the training data for the GCP class is collected from other aerial images. 80% of the patches are used for the training set and the others are used for the validation set.
After 200 epochs, the accuracy of the training set and the validation set are 98.1% and 98.0%, respectively. The CNN labels for multiple classes are shown in Fig. 14(a).
In Fig. 14(a), there are many salt-and-pepper errors, such as small cars labeled as houses and small houses labeled as the background. To remove these misclassifications, the superpixel-based majority voting is performed by segmenting the image into segments using the SLIC and DBSCAN algorithm.  The effect of this voting process on the entire image can be seen in Fig. 14. Fig. 15(a) and (d) shows the selected regions with CNN labels. Fig. 15(b) and (e) shows the segmentation of the selected region using SLIC. It can be seen that individual objects, such as cars and trees, are segmented into multiple superpixels. The CNN labels belonging to the same superpixels should have the same labels; so the labels in each superpixel are merged into one label that represents the majority class in the superpixel. Fig. 15(e) and (f) shows the CNN labels after merging, where it can be seen that the CNN labels after voting clearly represent the shapes of objects. The confusion matrices before and after voting for the entire image are shown in Tables II and III. The salt-andpepper errors have been greatly removed and the accuracy for each class is improved. For example, the accuracy of the house labeling with CNN without voting was 72.05% compared to 85.09% with voting.
2) Superpixel-Based CNN: The superpixel-based CNN is also used for the multiclass classification. Three thousand GCP superpixel training examples are created in the same way that was used for the binary classification and 3000 superpixels are  picked randomly for each class from the superpixels in the image. Two hundred epochs are preformed to achieve an accuracy greater than 98% for the training and validation sets. The CNN labels and the confusion matrices are shown in Fig. 16(a) and Table IV. There are many background superpixels labeled as sands since both the sand and background superpixels have a similar dark yellow color and hexagonal-like shape. Slidingwindow CNN can recognize the differences between the sand and background since it takes the surrounding information into account.
To refine the superpixel-based CNN classification result, the majority voting in the sliding-window CNN can also be applied since DBSCAN will include some surrounding information. The majority voting removes many misclassifications, especially the background pixels labeled as sand, as shown in Fig. 16(b) and Table V . V. DISCUSSION Both the sliding-window CNN with majority voting and superpixel-based CNN highly depend on the successful application of SLIC. All the pixels in the GCP must be partitioned into the same superpixel. However, there are factors that can make the performance of SLIC unstable. The first factor is the  shadow; the color contrast of the GCP is not obvious when the GCP is located in the shadow shown in Fig. 17(a) and (b). The compactness variable m in (3) should be small so that SLIC can detect small color contrasts. The second factor is the number k of superpixels in SLIC, where small values of k can lead to inaccuracies. The GCP may be partitioned into a large superpixel and the superpixel does not have a hourglass shape. If k is too big, the GCP may be separated into several small superpixels as shown in Fig. 17(c) and (d). To mitigate this problem, DBSCAN can merge these superpixel into a hourglass-shaped superpixel. However, it will increase the diversity of the superpixel and decrease the accuracy of the CNN classification.
Another limitation of this method is the high computational cost of SLIC for large images. To mitigate this problem, selective search [19] or a region proposal network [20] can be used to replace SLIC.

VI. CONCLUSION
We used both sliding-window CNN and superpixel-based CNN for automatically extracting the locations of objects from aerial photographs.
In the sliding-window CNN, the SLIC and majority voting are applied in the postprocessing to remove the misclassifications from the CNN results. This method accurately detects the locations of various objects and clearly delineates their boundaries. However, it cannot unambiguously locate the GCP due to the creation of many false positives.
In the superpixel-based CNN, SLIC is applied in the preprocessing to extract the unique shape and color of the GCP. This method can quickly narrow the scope to less than ten superpixels labeled as GCPs. An additional sliding-window CNN can further eliminate false positives. The numerical tests indicate the efficiency of the method in locating the GCP, but it does not perform well in locating other objects since the superpixels do not contain the background information.

APPENDIX A DENSITY-BASED SPATIAL CLUSTERING OF APPLICATIONS WITH NOISE
DBSCAN is a density-based data clustering algorithm [21], [22]. Given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors) and marks the outlier points that lie alone in low-density regions (whose nearest neighbors are too far away). There are two parameters that need to be specified in DBSCAN: a distance threshold and minimum number of points MinP ts. DBSCAN starts with an arbitrary seed point which has at least MinP ts points nearby within the distance of . Then we search each of these nearby points. For a given nearby point, if there are fewer than MinP ts points within its radius , this point is called a leaf . The search would stop at the leaf point. If there are at least MinP ts points within its radius , this point is called a branch point, and the search would continue to the nearby points around the branch point until all the nearby points are leaf point. The first round of search is finished when no branch points appear. All the points that have been searched in the first round belong to the same cluster and we never revisit them later. Then a new arbitrary point is picked and another round search starts. This continues until all of points in the images are assigned. If a point has fewer than MinP ts points with its radius , it is not a leaf node of another cluster. It is labeled as a noise point. An example of the DBSCAN is shown in Fig. 18.
In this article, we apply DBSCAN to all the updated superpixel centers [L k , a k , b k , x k , y k ] T . The distance between two centers is calculated using (3). If the superpixel centers belong to the same cluster, then the superpixels associated with these centers would be merged in the final segmentation.