KeepEdge: A Knowledge Distillation Empowered Edge Intelligence Framework for Visual Assisted Positioning in UAV Delivery

The Unmanned Aerial Vehicles (UAVs) delivery is being increasingly used in the field of logistics. However, it is highly challenging for a UAV to precisely identify the position for parcel delivering if it is only aided by the GPS, especially in some complex environments with weak signals and high interference. To address this issue, we present a knowledge distillation empowered edge intelligence framework, KeepEdge, to achieve visual information-assisted positioning for the last mile UAV delivery services. In our approach, we integrate deep neural networks (DNN) into an edge computing framework to enable edge intelligence which empowers the UAVs to autonomously identify the expected delivery position using visual information. Deploying the DNN model and conducting model inference on UAVs however, requires high computing performance. To manage the trade-off between the limited resources onboard the UAVs and high-performance requirements, here, we employ knowledge distillation to produce a lightweight model with high accuracy based on the full model trained in the Cloud. The lightweight model with significantly lower complexity and less inference latency is used onboard of the UAVs for accurate positioning. Comprehensive experiments show that the proposed framework achieves satisfactory performance for assisted positioning. A real-world case study is also presented to demonstrate the effectiveness of the proposed edge intelligence solution for UAV delivery services.


INTRODUCTION
U NMANNED Aerial Vehicles (UAVs), generally known as drones, have become a much-favored vehicle within many fields, such as transportation and surveillance. The UAV delivery services are one of the emerging technologies poised to become a global norm enabling new benefits and opportunities for the logistics sector [1].
Ideally, a UAV is expected to deliver a parcel straight to the hand of its recipient. To finish a delivery service, the UAV navigates to the delivery address by setting a flight course based on the Global Positioning System (GPS) coordinates. When the UAV arrives within the visual range, it locates and authenticates the recipient, then delivers the parcel. For instance, Cambridge Consultant implements a drone delivery system with a two-stage routing process. The UAV first reaches the recipient within visual range by GPS-based navigating. Then, it locates and authenticates the person receiving the package. This is done using precision optical tracking and a 3D imaging and ranging system. Once the UAV reaches the delivery area, the recipient points their mobile phone's flash at the drone, which blinks the dotted pattern. The drone hovers at a safe distance above the flashing LED, verifies the code, releases the package, and returns to the dispatch center. 1 Therefore, to complete an efficient delivery, it is essential to ensure that the UAV reaches the recipient within the visual range. In other words, a UAV is expected to reach the appointed position, where the recipient prefers to receive the parcel. Nevertheless, in complex environments such as dense urban areas, GPS-based navigation and positioning solutions become less reliable. This is mainly due to the weak signal reception and high interference. Hence, the UAV might not be able to arrive at the expected position due to, at least, the following two reasons.
It is difficult for a UAV to get to the exact location of the target building (i.e., order address) only using the GPS which is not always reliable. The official U.S. government information about the GPS confirms that many factors lead to the degradation of GPS positioning accuracy, such as satellite signal blockage due to buildings and bridges, signals reflected off buildings, radio interference, and so on 2 . In Fig. 1, the impact of high-rise buildings is illustrated as they can reflect the signal and obstruct lineof-sight visibility of the satellites. The UAVs may also not be able to recognize the precise position of the building, where the recipient is ready to take the parcel only based on the GPS. As illustrated in Fig. 2, for buildings such as dorms, shopping malls, and hospitals, there may be many optional delivery positions around the building but with the same address, namely the GPS location. In such cases, the UAV must be able to efficiently identify the accurate position specified by the recipient. The visual information acquired from the UAV-mounted cameras is considered as an effective complement to the GPS signal. Especially, deep learning based UAV assistive technologies have been extensively studies to provide high accuracy visual-aided navigation and positioning [5], [6].
Recently, several previous works investigated the use of deep learning models on resource-constrained UAVs. A straightforward approach is to degrade the capacity of the neural network so that the trained model can be deployed onboard while achieving a satisfactory inference time. However, such an approach inevitably results in a lower accuracy [7]. Other approaches have been proposed to reduce onboard computations for training the deep neural networks (DNN). For example, transfer learning has been employed to reduce the resources required to train a DNN. The network is first trained on the cloud or a workstation, followed by training the domain knowledge on the UAV. The latter includes only the training of the last few fully connected layers [8]. The onboard training and inference may, however, introduce longer latency, unless the UAV is equipped with high-performance modules such as GPUs. In general, the Cloud-Deviceoriented deep learning solutions are incompetent due to the intrinsic trade-off among the key aspects of the system including computation resources, accuracy, and latency (associated with the inference time). To deal with such an issue, deep learning models should be further customized to fit the resource-constrained edge devices (e.g., UAVs in our scenario), while carefully dealing with the mentioned tradeoff by implementing new hierarchical architecture. This paper presents a novel assisted positioning solution for the UAV delivery services based on the edge computing architecture. Our proposed method utilizes visual information to autonomously assist in identifying the expected delivery location. In our solution, the user is required to upload a photo attaching to the order. This photo depicts the position of the building, where he/she is ready to receive the parcel (i.e., the exact delivery position).
The proposed solution involves two main steps. The first step is to identify the target building based on the order address. The UAV reaches to the approximate delivery location using the GPS, and then identifies the target building amongst the neighboring buildings. The second step is to identify the accurate location of the target building, where the parcel is expected to be delivered. The UAV arrives within the visual range of the recipient, and then delivers the parcel to the recipient after locating and authenticating. To identify the exact delivery position, the UAV takes photos of the target building as it circles around the building, and matches the photo captured by the UAV-mounted camera with the photo uploaded by the user. For brevity in this paper, we exclude recipient localization and authentication as they are beyond the scope of this paper.
Different from the existing approaches with Cloud-Device two-tier architecture, we adopt the edge computing framework for accommodating the computational resources of the DNN on resource-limited UAVs. In this method, the deep learning model is trained offline on a dataset (including ground-view images and UAV-view images) on the cloud server. To manage the mentioned trade-off, knowledge distillation (KD) is used to create a lightweight DNN model on the edge server (e.g., the base stations in our scenario) which achieves approximately high accuracy as the full model trained in the cloud with relatively less inference latency. The lightweight model is then pushed onboard the UAV when it approaches this base station. The model inference is conducted on board to identify the target building and the exact delivery position.
To demonstrate the effectiveness of the proposed solution, we conduct comprehensive experiments on a dataset collected on the campus of Anhui University. Experimental results confirm that the proposed framework can ensure satisfactory positioning performance. Using knowledge  2. https://www.gps.gov/systems/gps/performance/accuracy/ distillation, the obtained lightweight DNN model is deployed on the UAV, where its accuracy is comparable with that of the full model trained in the cloud.
The main contributions of this paper are the following.
1) We propose a visual assisted positioning solution for the last mail UAV delivery services, which enables the UAV to automatically identify the expected delivery position in the scenarios where the GPS signal is weak or unstable. This solution acts as a complement element to the GPS-based positioning system. 2) To implement the visual assisted positioning solution, we propose an edge intelligence framework, namely KeepEdge, that implements the DNN on the edge computing framework to enable onboard positioning for autonomous delivery. We use knowledge distillation between the cloud and the edge servers to strike a balance between the onboard resource constraints and the model performance requirements. This enables the DNN model inference to be conducted on the UAV with satisfactory performance on both accuracy and inference delay. 3) Comprehensive experiments based on a real-world dataset show that the knowledge distillation effectively improves the model's performance when the size of training samples is small on the edge server. The model deployed on the UAV achieves high performance with over 93% precision rate for a training set including 200þ images. The remainder of this paper is organized as follows. Section 2 analyzes the UAV delivery scenario and the research problem. Section 3 provides an overview of our framework, and presents the edge intelligence framework KeepEdge. Section 4 presents the experimental results, and Section 5 reviews the related work. Section 6 concludes this paper.

SCENARIO AND PROBLEM ANALYSIS
In the urban areas, there are normally many distribution centers with optimized locations based on the costs, environmental, and social criteria, and constraints [9]. Each distribution center defines a last-mile segment for the parcel delivery. From the perspective of logistics, packages are firstly transferred by the trucks from the warehouses to the urban-wise distribution centers. They then cascade down to the region-wise dispatch centers based on their delivery addresses. At last, the parcels are delivered to the consumers from respective region-wise dispatch centers. Since the UAVs are not limited by the traffic and terrain, the UAV delivery offers enormous potentials to overcome the last mile delivery hurdles in urban environment.
In this paper, we focus on the region-wise UAV delivery services in urban areas. To start a delivery, the system converts the delivery address to a GPS location, A UAV is then dispatched to the corresponding region for the final delivery. Once the UAV arrives over the approximate delivery location based on the GPS signal, it switches to the lowspeed mode and takes photos at a certain frequency by its camera to identify the target building amongst the neighboring buildings, then determines the exact delivery position of the building.
According to the above scenario, we can summarize the key requirements for a successful delivery.
1) High accuracy. The matching algorithm should be highly accurate to ensure an efficient delivery. In addition, the UAV is expected to accurately identify the target position with limited a priori knowledge of the new locations. 2) Real-time decision: The UAV takes photos with a certain frequency as it circles around the buildings. After the UAV takes each photo, a real-time decision should be made to identify whether the current view of the UAV matches the photo provided by the user. Otherwise, the decision will be expired due to the delay. The DNN as an emerging technology has been widely applied in image recognition and matching. However, given the above two performance requirements and the limited computational resources, applying DNN in our scenario is challenging. To ensure high accuracy, the DNN model needs to be trained using sufficient computational resources with sufficiently large training sets. For this issue, one may consider the Cloud-Device architecture based approach, i.e., sending images to the well-trained DNN model on the cloud server for inference. Nevertheless, the interaction between the remote cloud server and UAV brings the following two issues, which make the Cloud-Device architecture not applicable in the considered scenario.
1) The interaction takes significant time and most likely results in invalid inference results due to the transmission delay. The real-time requirement may not be satisfied.
2) The data transmission between the remote cloud server and the UAV may be interfered (even interrupted) by many factors, considering the continuous movement of the UAV. Therefore, the reliability of the interaction process cannot be guaranteed. Considering the above issues, model inference should ideally be carried out on the UAV. For this purpose, the deep learning model should be customized to fit the resource-constrained UAV while meeting the above performance requirements. This is the key challenge that will be addressed in the following sections.

KEEPEDGE: KNOWLEDGE DISTILLATION EMPOWERED EDGE INTELLIGENCE FRAMEWORK
To address the above key challenge, we propose to integrate the DNN model into the edge computing architecture to achieve edge intelligence for UAV delivery service. In this section, we first provide an overview of the proposed framework KeepEdge. Then, we present the implementation detail of the edge intelligence for autonomous positioning.

Fig. 3 illustrates the proposed edge intelligence framework
KeepEdge. In the UAV delivery scenario, urban-wise distribution centers are regarded as the cloud servers, which provide sufficient computational resources for training the DNN models. Base stations located in the delivery area of each regionwise dispatch center are regarded as the edge servers with finite computational and storage resources. There are normally several base stations in each delivery area. Therefore, the buildings (i.e., delivery addresses) within the delivery area of a dispatch center are separated by the coverage of the base stations. The UAVs are end devices in the KeepEdge framework. In Fig. 3, the DNN model is trained on the cloud server offline. After a parcel arrives at a region-wise dispatch center, it is assigned to a UAV for final delivery. In the meantime, the corresponding edge server (i.e., base station) is determined based on the delivery address. The lightweight DNN model on the edge server is trained by knowledge distillation with local images. When the UAV approaches the range of the corresponding base station, the lightweight DNN model is downloaded from the base station. Different from the large model on the cloud servers, the lightweight DNN model is a compressed model that can be deployed on the UAV. When the UAV arrives at the approximate location of the delivery address using the GPS signal, it first identifies the target building, then circles around this building taking photos and matches the photo with the image attached on the order. This is to determine the appointed position, where the recipient receives the parcel. The obtained photos are transmitted to the edge server after this delivery and further uploaded to the cloud server for iterative training of the full DNN model. A small part of the photos is stored on the edge server locally. The updated full model, in turn, helps to train a better lightweight model on the edge server with new knowledge and the local images. Such a kind of interactive training process continuously improves the performance of the lightweight model.
The edge server, as a connection point between the cloud server and UAVs, plays an important role in the KeepEdge framework. On the one hand, the edge server as a relay node stores the local images collected by the UAVs and forwards them to the cloud server by fiber-optic cable. On the other hand, the distributed edge servers receive general knowledge from the cloud server. Then the lightweight model can be trained on each edge server with the general knowledge and local images. In this way, the localized lightweight model can performance well in specific scenario covered by the current edge server.
After a UAV finishing the delivery, it transmits the photos to the edge server over a wireless link. Today, 4G LTE and 5G offer the most plausible wide-area connectivity from a UAV in flight to its associated edge server. Besides, more specialized wireless technologies for UAVs, such as Lightbridge 2, can also be used. Since this paper mainly focuses on the visual-assisted positioning solution and the edge intelligence framework, the specific wireless technology and communication model are not discussed in detail. Some existing communication models, such as the one presented in the reference [3], are applicable in our scenario.
Please note that in this paper, the conversion mechanism from delivery address to the GPS location is not discussed as it is an essential function which should be included in any existing UAV delivery systems. Besides, KeepEdge is an improved solution over GPS based positioning for accurately locating the delivery position, but GPS based positioning is still required for UAV navigation during the flight.
Dataset for DNN model training on the cloud server To learn discriminative image representation, the DNN model is trained on the dataset which contains UAV-view images and street-view images. The combination of the two different viewpoints provides a comprehensive view to capture rich information of the target buildings.
In general, the UAV is the primary platform to collect UAV-view images. At the initial stage of UAV delivery system, UAVs in a region-wise dispatch centre are assigned to collect the pictures of the buildings in the current region. To guarantee the availability of sufficient images, the UAV first records video during the flight, then the video frames can be extracted to separated images. These images will be shared with the conresponding edge servers to train the initial localized student model. As the delivery system continue to operate, the UAVs constantly collect the images during the delivery, which will be uploaded to the cloud server. The street-view pictures can be obtained by some publicly available tools such as Google Street View and using image search engines [10].
Next, we will introduce the training process of the lightweight DNN model on the edge server and the process of student model inference for identifying the target delivery location on the UAV. The corresponding algorithms of the two processes are presented in Section 3.2.

Knowledge Distillation for Assisted Positioning
Distillation is the technique that extracts and transfers the implicit knowledge from a well-trained but cumbersome model (namely teacher model, model_T) to a lightweight model (namely student model, model_S) [11]. Considering the performance requirement and limited computational resource of the UAV, deploying the full DNN model on the UAV is impractical. To achieve real-time decision-making with satisfactory accuracy for model inference on UAV, we employ knowledge distillation to train a significantly compressed student model with approximately high accuracy on the edge server, and then deploy it on the UAV. The training process of model_T includes three steps. (1) We use the pre-trained ResNet-152 model to accelerate the training process of networks [25]. To achieve a better performance, we initialize the teacher network with the welltrained model on ImageNet which contains over 14 million images; (2) We adopt domain adaptation to transfer the knowledge learned by the pre-trained ResNet-152 to the teacher model, which enables the teacher model to get high performance. Specifically, we tailor the ResNet-152 model by removing the original classifier and insert a fully-connected layer and a classification layer after the average pooling layer. Meanwhile, the LeakyReLU is selected as the activation function; (3) We jointly fine-tune the parameters of ResNet-152 and the classifier by maximizing the probability of the correct labels with the cross-entropy loss. Given an input image, the teacher network computes a feature vector h c 2 R d , upon which we build a classifier for image classification with l c labels. d is the feature dimension of the image. The predicted probabilities are y c ¼ softmaxðW c h c Þ, where W c 2 R lcÂd in the teacher network. The output of the teacher model on the cloud server is the soft target (i.e., knowledge).

Knowledge Distillation From Cloud Server to Edge Servers
For the student network, model_S is formed by a K layers of CNN and max-pooling network together with several fully connected neural networks and a non-linear classifier. The structure of model_S is shown in Fig. 5. After feeding the images into the K-layer CNN, the hidden states of the last layer h e 2 R d are fed to a fully connected layer with Lea-kyReLU. The output is then passed to a Softmax layer for classification with l e labels. The predicted probabilities are y e ¼ softmaxðW e h e Þ, where W e 2 R leÂd .
In Fig. 5, the value of K is determined by the capacity of the edge server. It is worth noting that we restrict the changes to the architecture to a minimum level to revisit the embedding power of the CNN itself and avoid any complex neural layers to reduce the computational complexity. In order to reduce the computational complexity, the scale of the Softmax layer in the student network is smaller than that of the teacher network. When the UAV flies to the coverage of an edge server using the GPS signal, the welltrained student model is automatically downloaded to the UAV for model inference.
In addition to optimizing the DNN model and KD process, the precision rate of image matching can also be improved through some system settings. For example, the UAV can be set to take two or even more consecutive photos at the approximately same angle of the target building for model inference. The first photo is used to identify the delivery position, and the rest is used for verification. Such a setting can surely increase the precision rate, but the trade-off between the efficiency and performance needs to be considered based on the requirements of specific application scenarios.

Loss Function
The student network can learn to mimic the teacher network's behavior with two different distillation objectives: logits [9] and SoftTarget [11]. In the proposed framework, the learning objective function of SoftTarget is chosen. The loss in the student network is defined as Loss student ¼ aH y e ; P e ð Þþ 1 À a ð ÞL P c T ð Þ; P e T ð Þ ð Þ (1) where a controls the balance between two tasks of fitting the hard-targets on the student network and approximating the softened teacher network. P c ðT Þ and P e ðT Þ are the outputs of the teacher network and the student network after temperature scaling with temperature T respectively: In the loss function, we let the prediction of the student network as a function of the activations in the penultimate layer as y e ¼ softmaxðW e h e Þ. Thus, the left part of the loss function is to minimize the expected value of the crossentropy between the true targets y e and the network outputs P e as in H ðy e ; P e Þ ¼ P L lc¼1 À y l c logðP l c Þ, where y l c ¼ 1 for correct class and y l c ¼ 0 for the rest. In the proposed framework, the student model learns to classify the delivery location with the limited images and knowledge from the teacher model. Algorithm 1 discripts the training process of student model.

Selection of Teacher Model and Student Model
Despite the recent interest in variants of the KD, there is still a lack of adequate theoretical backing for the efficacy of KD and its dependence on teacher-student architecture [17]. For the teacher model, we select the pre-trained ResNet-152 on ImageNet as the backbone model, and adopt domain adaptation for better performance. Note that the ResNet-152 is a representative deep neural network that can achieve high performance for image classification and other recognition tasks. Other networks can be also considered as the teacher model.
We design the student model regarding the following suggestion: Student-Teacher similarity is highly related to how well the student can mimic the teacher. If the student is similar to the teacher, it will produce outputs similar to the teacher [2], [11]. To maintain high Student-Teacher similarity, the student model model_ S i is formed by a K-layer CNN and a Maxpooling network together with several fully connected neural networks and a non-linear classifier. The K-layer CNN in the student network can be treated as a miniature of the teacher network, where the ResNet-152 used in teacher network consists of 152-layers CNN with residual learning. In our scenario, the student network is trained with the loss of the teacher network, where K is much less than 152. The teacher network and the student network share the same classifier layer to predict the building label. The model complexity OðmÞ of the model_ S i is determined by the layers in student model. We define the complexity of a convolutional layer of CNN as OðoÞ , where o denotes the dimension of the output layer. The total complexity of model_ S i can be formalized as O ðmÞ ¼ OðKÃoÞ þ Oðh e Þ þ Oðl e Þ. Since the number of lables l e is far less than l s , and K is far less than 152, the complexity of the model_ S i is much smaller than the teacher model model_T.

Time Complexity Analysis of Model Inference
Algorithm 2 describes the process that a UAV identifies the delivery position. Firstly, when the UAV flies to the approximate location of the delivery address using the GPS, it takes photos of the surrounding buildings (line 8) and identifies whether the current building is the target building by model inference (lines 9-10). After the target building is determined, the UAV flies around the building and takes photos to identify the exact delivery position (line 17). The key to identifying the position is to calculate the similarity of the feature vector between the photo provided by the user and the ones taken by the UAV.
Algorithm 2. The inference process of the UAV to identify the target delivery location 1: INPUT: GPS and the model_S j on the target edge server e j , the target photo p t provided by the user. 2: OUTPUT: the exact delivery position. 3: The UAV flies to e j based on GPS signal. 4: Define S ej to save the collected photos during the delivery. 5: Download model_ S j from e j to the UAV. 6: Inference model_S j on e j with p t to obtain the label l t of the target photo. 7: repeat 8: UAV takes photo p uav on the flight route; 9: Inference model_ S j on e j with p uav to obtain the label l uav ; 10: if l t ! ¼ p uav : then 11: Continue; 12: else 13: Save p uav and l t in S ej ; 14: end if 15: until l t ¼ p uav . 16: repeat 17: The UAV hovers around the target building and takes the photo p uav l ; 18: Save p uav l and l t in S ej ; 19: Calculate d ¼ cosineðp uav l ; p t Þ. 20: until d reaches the convergence 21: Output the position label and finish the delivery.
We use the cosine similarity to determine whether the location in the current photo p uav l is the exact delivery position where the recipient is preferred to take the parcel (line 19). All the photos obtained during this delivery are saved in a set S ej for interactive training of the full DNN model on the cloud server.
Based on the description above, the time complexity of Algorithm 2 can be formulated as: where OðmÞ Á log ðlÞ and OðnÞ Á k are the time complexity of identifying the target building and the exact position of the building for parcel delivery. In Equation (4), OðmÞ is the complexity of model inference on UAVs. l is the number of buildings within the coverage of the base station which is normally small in real-world scenarios. OðnÞ is the complexity of cosine operation, which depends on the dimensions of the features of p uav l and p t . k is a constant which is equal to the number of photos that have been taken before identifying the delivery location while the UAV is hovering around the building. Note that OðmÞ is determined by the scale of student network, which is much smaller than that of the teacher network. It is seen that the time complexity of Algorithm 2 is polynomial. This algorithm can achieve real-time performance when identifying the target building and the exact delivery position of the building onboard.

EXPERIMENTS
In this section, we conduct experiments to evaluate the performance of the KeepEdge. We implement KeepEdge by using the open source deep learning library PyTorch. The threetires Cloud-Edge-Device experimental environment is simulated using different platforms as shown in Table 1.

Dataset
The dataset contains UAV-view images and street-view images. We use the DJI Mavic Air drone with a 4K Full-HD video camera to collect the UAV-view images of 21 buildings (marked from 000 to 020) in the campus of Anhui University based in the city of Hefei, China. To enable the scale changes and obtain comprehensive viewpoints, the height of the UAV gradually increases from 30 meters to 60 meters with a spiral curve. The UAV records the flight video, then we extract video frames to images and crop them to 512Â512pixel. Thus, we can obtain 18368 UAV-view images. For the street-view images, we record the video of the buildings by phone camera manually and crop them to 17514 images of the same size. Therefore, the dataset contains 35582 images.
To train the full model in the cloud server, we split the dataset into the training test sets which include 29342 images (15023 UAV-view images and 14319 street-view images) and 6540 images (3345 UAV-view images and 3195 street-view images), respectively. Besides, we assume that the selected base station covers 8 adjacent buildings. Hence, we select the UAV-view images of all the 8 buildings (8 labels: 002, 003, 007, 009, 014, 016, 017, and 020) as sub-dataset to train the student model on the edge server. We set four training sets including 112, 224, 336, and 448 images

Performance of the Teacher Model on the Cloud Server
The network structure of the teacher model Model_T in our experiment has been introduced in Section 3. Leaky-ReLU (alpha ¼ 0.1) is selected as the activation function. We compare Model_T with four representative pretrained networks, including two types of VGGNet (VGG-16 and VGG-19) [33] and two residual networks (ResNet-101 and Res-Net152) [25].  Table 2 shows the comparison results of different network models on the same test set (6540 images). The results show that Model_T outperforms the other four baseline network models. The accuracy, precision, recall rate and F1score are over 99%. However, the parameter size and model size of these models are large. Specially, the size of VGG-19 reaches 575.9MB. Hence it is difficult to deploy them on resource-constrained UAVs. Therefore, a lightweight network model is necessary to achieve onboard real-time decision-making.

Performance of the Model on the Edge Server
We train 4 DNN models EdgeModel_K (K ¼ 12,310) on the edge server based on the training set which involves 224 images of the 8 buildings within the range of the base station. EdgeModel_K shares the same structure with the model in Fig. 2. K is the number of convolutional layers involved in the model. For example, EdgeModel_(K ¼ 1) includes 1 CNN layer. Leaky-ReLU (alpha ¼ 0.1) is selected as the activation function.
It is seen in Fig. 6 that the performance of the model generally improves with the increase of the model complexity namely larger parameter size and model size ( Table 3). The model with smaller parameters and model size (e.g., Edge-Model_(K ¼ 1) and EdgeModel_(K ¼ 2)) can be deployed on resource-constrained UAVs. But the key performance indicators (including accuracy, precision, recall rate and F1 score) of both models are less than 80%, which may not meet the requirements for real-world application.
To evaluate the impact of training set size on the performance of the model trained on the edge server, we train the EdgeModel_(K ¼ 2) on the training sets with 112, 224, 336, and 448 images respectively. The test set is the same as in the experiments above. As shown in Table 4, the model performance improves at least 14% as the training set size increases. However, in general, the performance of Edge-Model_(K ¼ 2) trained on all four datasets cannot meet the demands.
Next, we introduce KD to train the student model Mod-el_S (K ¼ 2) on the training sets with different sizes. Soft Target is used to transfer knowledge to the distilled model. Fig. 7 shows the comparison results of the student model performance with KD and the original model without KD. The results demonstrate that knowledge distillation can remarkably improve the performance of the DNN model on the edge server. The student model achieves over 93% on all the indicators despite the small training set size with only 224 images. Therefore, the KeepEdge is effective when the training example size is small on the edge server. A UAV can efficiently dispatch the parcel to an area that is rarely visited, and it can perform better as the times of dispatching to similar locations increases.

Performance of Student Model Using Different KD Methods
In KeepEdge framework, we use Soft Target to transfer knowledge to the distilled model. In our experiments, Logits [9] based knowledge distillation is selected as the baseline for comparison. Table 5 records the performance of the student model based on different KD methods. The performance of the teacher model is also presented as a reference. We select 2-layer CNN model as Model_S, and the training set includes 448 images. As shown in Table 5, the teacher model can achieve over 99% accuracy ratio, which outperforms all the student models. Comparison results of three different student model demonstrate that KD can help to improve the performance

Comparison Result of the Proposed KeepEdge and Existing Approach
In this section, we compare the KeepEdge with a representative drone-based geo-localization method, namely Univ-1652 in short [10]. The Univ-1652 is a strong CNN baseline which is based on ResNet-50 pre-trained on ImageNet. In our approach, we select 2-layer CNN model as Model_S, and  use Soft Target to transfer knowledge to the student model. As shown in Table 6, the two approaches can achieve nearly the same performance on the training sets with different sizes. The accuracy ratio and precision ratio of the KeepEdge is marginally higher than that of the Univ-1652 when the size of training set increases. The accuracy and precision ratio of both approaches are above 95% with 448 training images. But the parameter size and model size of the Univ-152 are two orders higher than that of our approach. The model size of the Univ-152 reaches 106.8 MB, which makes it difficult to be deployed on common UAVs with weak computional resource.

Case Study
In this section, we present a case study to demonstrate the effectiveness of the edge intelligence framework KeepEdge. After the UAV arrives at the approximate position of the delivery address based on the GPS signal, it identifies the appointed delivery position by the two steps: (1) Step 1: identifying the target delivery building from surrounding buildings We collect two groups of test datasets, each of which includes 1 street-view photo from the user and 10 UAVview photos (including 8 photos of the 8 buildings within the coverage of the base station and 2 extra photos of the buildings). As shown in Table 7, our solution can identify the target delivery building by student model inference onboard. The target delivery address for the user is in the building labeled 002 and 020 in case 1 and case 2. (2) Step 2: identifying the delivery position of the target building where the recipient is ready to get the parcel When the UAV has identified the target delivery building, it hovers around the building and determines the exact delivery position based on the photo provided by the user. In case 1, we take 10 photos of building 002 with different orientations to simulate the process of taking photos while hovering around the building. In Table 7 (Case 1, step 2), our solution determines the main building entrance in building 002 as the expected location. After the delivery location is identified, the UAV finishes hovering around and fly towards the position for further localization and authentication of the recipient. In our previous work, we also proposed techniques for localization and  authentication of the recipient through the interaction between the UAV and the recipient [24]. The building in case 2 has a more complex architecture, but our proposed method can also determine the exact delivery position. We also record the time consumption for the onboard assisted positioning on other hardware with different computing power. Table 8 shows the required time for the above two cases running with CPU and GPU. For step 1 in case 1, it takes about 22 seconds to finish 10 times of model inference on Intel Pentium 4415Y CPU, with 2.2 seconds for one time of model inference on average. It takes an average of 0.429 seconds for one-time model inference on Intel i7-9700 CPU. Time consumption can even be reduced to microseconds when the model inference is performed on the GUP. For the step 2, the time consumption for one time of photo-matching on Intel i7-9700 CPU is about 0.422 seconds on average. Therefore, our solution applies to real-world scenarios where the UAVs in the market can be customized with different computing power according to the performance requirements.
Besides, in Table 7, the UAV can identify the appointed delivery position by two times of photo-matching (one time for identifying the target building and one time for identifying delivery position) for Case 1. While it takes five times of photo-matching for Case 2. Therefore, after a UAV arrives at the approximate delivery location by the GPS signal, the time consumption needed for identifying the appointed delivery position is various in different cases. The efficiency depends on many factors such as the starting point and flying direction of the UAV as well as the frequency of the photo shooting.

RELATED WORK
The convergence of edge computing and artificial intelligence, namely edge intelligence, has been extensively investigated in recent years. Nevertheless, in contrast to various deep learning-based intelligent services on centralized computational platforms, the number of intelligent services on the edge is still limited. This is mainly due to the cost, latency, and privacy concerns [12].
The major challenge behind the cost and latency concern is to push deep learning computations from the Cloud to the edge where the edge has much fewer computational resources [13]. Recently, several efforts have been made to reduce the computational complexity of the DNNs to enable their usage on the edge with satisfactory latency. Some studies focus on designing compact network architectures for resource-constrained end-devices. For example, Google Inc. presents a class of lightweight DNN models called Mobile-Nets for mobile and embedded vision applications [14]. SqueezeNet [15] is also a small CNN architecture that achieves AlexNet-level accuracy on the ImageNet with 50Â fewer parameters. Chen et. al also present Eyeriss v2 [16], a DNN accelerator architecture for compact and sparse DNN processing. The Eyeriss v2 improves both processing speed and energy efficiency hence supports using DNNs on mobile devices.
Model compression has been commonly adopted to efficiently alleviate the tension between resource-hungry DNN and resource-constrained edge devices [18]. The mainstream techniques for model compression include weight pruning [19], data quantization [20], and so on. Besides, knowledge distillation has become an important branch in the field of DNN compression and acceleration [13], [21]. It can extract implicit knowledge from a well-trained model (teacher) and "transfer" the knowledge to a small DL model (student). The knowledge transferring process from the teacher network to the student network naturally corresponds to the hierarchical edge computing architecture. Network knowledge transfer has developed as one of the most promising technologies to realize intelligent edge systems [22]. In the proposed solution, knowledge distillation is also employed to produce lightweight DL model to fit resourceconstrained UAVs. Specifically, we use local images of each dispatch area to train customized DL model. Thus, the localized model can perform well in the corresponding dispatch area with specific scenario.
Deep learning-based UAV navigation and positioning Autonomous navigation and positioning are the key desirable features of UAVs but are not easy to realize in varying environments and weather conditions. To assist navigation in cases where the Global Navigation Satellite System is not available, proper visual sensors are often used. Visual information is utilized to train the deep learning models for assisted navigation and positioning [23]. Various deep learning models such as Deep reinforcement learning, convolutional neural network (CNN), and deep CNN have been employed to achieve autonomous navigation and positioning in recent years [7], [26].
However, the existing approaches are mostly based on the cloud-end architecture, and the model training and inference are carried out in the Cloud [27]. The interaction between the remote cloud server and UAVs will inevitably introduce transmission delay and potential reliability concerns. Liu et. al converge the deep learning and three-tire edge computing architecture to implement a smart autonomous Unmanned Aerial System [28]. But the UAV requires high-performance AI computing device, NVIDIA TX2, for on-board data processing and decision making, which is totally different from the scenario for our proposed solution where the UAV is constrained with computing resources.
Existing UAV delivery solutions The idea of using UAVs, or drones, for last-mile delivery, has drawn increasing interest from academia and industry. Khosravi et. al design efficient drone delivery systems that can simultaneously provide fairly uniform coverage with efficient package delivery in realistic scenarios [29]. Several research works are also focused on the performance optimization of the UAV delivery services from other perspectives, such as service area zoning [30], energy optimization [31], and path planning [32]. For the commercial applications and solutions, the technology and logistics giants such as Google (Project Wing), Amazon (Prime Air), and DHL are investing in UAV delivery and lobbying for making this service a reality. Several startups and emerging aerial delivery companies such as Flirtey, Vaye, and Ehang are also racing against time to launch their drone delivery solutions for medical supplies. In China, companies such as JD.com and Alibaba have been developing drone delivery capabilities. However, most existing UAV delivery systems are still under trial and are still far from the existing demand in practical application. In recent years, edge computing has emerged as a promising solution for UAV delivery systems [4], [8].

CONCLUSION
In this paper, we proposed a visual information-based edge intelligence framework, KeepEdge, as a complement to the GPS-based positioning system for locating the delivery position. To enable the UAV intelligence for parcel delivery, we integrated the deep learning technology into the edge computing framework. Considering the constrained computational resource of the UAV, we employed knowledge distillation to produce a significantly compressed student model on the edge server with high accuracy approximates to the teacher model trained in the Cloud. Thus, the model inference can be conducted onboard the UAV, which ensures real-time decision making with satisfactory accuracy. In addition, with the assistance of visual information, our proposed edge intelligence framework can be applicable even in complex environments where the GPS signal is weak or unstable.
For the future work, we intend to integrate the proposed framework with the localization and authentication of the recipient for a complete smart delivery. Besides, privacy matter is a major concern to the public for the application of UAV delivery. Privacy protection related works such as privacy-aware flight path planning will also be considered in the future work.
Haoyu Luo (Member IEEE) received the bachelor's degree from the East China University of Technology, in 2011, the master's degree from Northwest Normal University, in 2014, and the PhD degree from Wuhan University, China, in 2018. He is currently a full researcher with the School of Computer Science, South China Normal University. His research interests include workflow system, cloud and edge computing. Chong Zhang received the BTech degree in programming and the BTech (Honours) degree from Deakin University, Australia, in 2019 and 2020, respectively. He is currently working towards the PhD degree with Deakin University. His research interests include mobile edge computing and federated learning.
Gansen Zhao (Member IEEE) received the PhD degree from the University of Kent at Canterbury, U.K. He is currently a full professor in computer science with the School of Computer Science, South China Normal University. His main research interests include BlockChain, trust management, and cloud computing. He is currently an associate professor with the School of Information Technology, Deakin University, Australia. His research interests include workflow systems, cloud and edge computing, and human-centric software engineering.