Cooperative Scheduling Schemes for Explainable DNN Acceleration in Satellite Image Analysis and Retraining

The deep learning-based satellite image analysis and retraining systems are getting emerging technologies to enhance the capability of the sophisticated analysis of terrestrial objects. In principle, to apply the explainable DNN model for the process of satellite image analysis and retraining, we consider a new acceleration scheduling mechanism. Especially, the conventional DNN acceleration schemes cause serious performance degradation due to computational complexity and costs in satellite image analysis and retraining. In this article, to overcome the performance degradation, we propose cooperative scheduling schemes for explainable DNN acceleration in analysis and retraining process. For the purpose of it, we define the latency and energy cost modeling to derive the optimized processing time and cost required for explainable DNN acceleration. Especially, we show a minimum processing cost considered in the proposed scheduling via layer-level management of the explainable DNN on FPGA-GPU acceleration system. In addition, we evaluate the performance using an adaptive unlabeled data selection scheme with confidence threshold and a semi-supervised learning driven data parallelism scheme in accelerating retraining process. The experimental results demonstrate that the proposed schemes reduce the energy cost of the conventional DNN acceleration systems by up to about 40% while guaranteeing the latency constraints.


INTRODUCTION
F OR automating reliable remote sensing and improving the analysis speed of human supervisors, it is necessary to design a satellite image analysis and retraining system based on explainable DNN. The explainable DNN generates a description of its prediction, and the human supervisors return feedbacks, such as corrections or new label annotations, for retraining [1], [2]. However, the retraining system still has several bottlenecks.
First, explainable DNN, which achieves high accuracy for reliable satellite image analysis, requires high computational complexity. In general, higher inference accuracy can be achieved with a deeper and wider network containing a greater number of network layers and channels [3]. These features significantly increase the computing complexity and memory access complexity that sophisticated hardware accelerators are required to address. Furthermore, DNN tasks computationally have a high workload with massive input data (e.g., large high-definition images, etc.) Second, the labeling task by supervisors is cost-expensive and slow. Especially, the labeling speed is relatively too slow compared to the input data generation and explainable DNN based analysis speed. Since the image data is generated in real-time and delivered to human supervisors, the partial data is discarded without labeling. It causes overfitting on explainable DNN retraining due to scarcity of human annotations [2].
For these reasons, the analysis and model retraining process suffers from drastically long processing time and slow convergence speed [2]. To solve these bottlenecks, the conventional DNN acceleration systems attempt to schedule the DNN process for acceleration, using a large-scale accelerator cluster. However, their scheduling schemes have several problems to successfully implement DNN acceleration in the analysis and model retraining process.
Heterogeneity of Accelerator Environment. Heterogeneous accelerators, described in Table 1, should be considered for accelerating DNN processing tasks. Both GPUs and FPGAs have been deployed and utilized in datacenter infrastructure at a reasonable scale [8], [9], [10], to process a given DNN workload quickly and energy-efficiently. Current generation DNNs depend heavily on dense floating-point matrix multiplication, which is well mapped to GPUs [4]. For this reason, GPUs are widely used for DNN acceleration [6]. Meanwhile, several recent studies have attempted to configure the HPC environment with FPGA to reduce the energy cost of a largescale DNN process by using its high processing efficiency per energy [5]. Unlike other processors that operate with a combination of predefined sets of operations, the FPGA can specify functionality at the gate level. Depending on the design method, FPGA can implement compressed neural networks with weight quantization to accelerate certain operations.
Computational Complexity in Explainable DNN. Fig. 1 shows a typical scheduling with explainable DNN in satellite image analysis and retraining. The explainable DNN delivers object detection or classification results and their visual explanation to the human supervisor. And then, the human supervisor returns feedbacks for retraining [2]. Since labeling satellite images is time-consuming and its cost is expensive, many previous works adopt Active Learning (AL) to optimally select the data samples to be labeled from an unlabeled data pool to achieve the highest accuracy within a fixed labeling budget [17]. In general, explainable DNN is a large-scale DNN for high reliability. Thus, network layers of an explainable DNN have various computing complexity and memory demands. Explainable DNN model is composed of several components such as convolutional layer(CL), fully connected layer(FC) and region proposal network(RPN) [21]. These components have different processing time and energy cost performance depending on what type of accelerator is allocated.
In this paper, to overcome these problems, we design new explainable DNN acceleration scheduling schemes. We propose the cooperative scheduling schemes utilizing the layerlevel management of the explainable DNN in image analysis as well as the confidence level criteria and data parallelism in retraining process. Our work has following contributions.
First, we define the latency and energy cost modeling to derive the optimized processing time and cost required for explainable DNN acceleration in cooperative satellite image analysis and retraining. Especially, we propose a cooperative scheduling scheme via layer-level management of explainable DNN on FPGA-GPU to accelerate analysis process and achieve a minimum processing cost. In addition, we propose a confidence threshold based adaptive unlabeled data selection scheme and a semisupervised learning driven data parallelism scheme for accelerating retraining process. Last, we evaluate the proposed system using a largescale aerial image dataset for object detection or classification, such as DOTA and AID. The experimental results demonstrate that the proposed schemes effectively reduce the retraining cost compared to the conventional DNN acceleration systems, while guaranteeing the latency constraints.

A MODEL DESCRIPTION ON COOPERATIVE SCHEDULING FOR EXPLAINABLE DNN ACCELERATION IN SATELLITE IMAGE ANALYSIS AND RETRAINING
In this section, we present some limitations in applying the conventional DNN acceleration schemes to the explainable DNN acceleration in satellite image analysis and retraining. Especially, to overcome these limitations, we discuss new explainable DNN acceleration scheduling schemes.

Design of Cooperative Scheduling for Explainable DNN Acceleration
In satellite image analysis and retraining, the workload of human supervisors is still a big bottleneck (i.e., high labeling cost and slow labeling speed). To address this issue, we adopt AL on unlabeled data and semi-supervised learning-based retraining. Most unlabeled samples are typically ignored in AL [23]. AL selects only a few of the most informative samples (e.g., samples with very low predictive reliability) for labeling at each training stage. It is difficult to fine-tune the DNN with these few samples of information to obtain appropriate functional representations. In semi-supervised learning, unlabeled data (often much cheaper to obtain) is also used to train DNNs [24]. Unlabeled data can be used to train some distributions, which is helpful to create more sophisticated and effective regularization. To overcome the bottleneck of human supervisor's workload, we apply AL on unlabeled data and semi-supervised learning, which considers labeled and unlabeled data together, to the cooperative scheduling in satellite image analysis and retraining. Based on them, we organize data selection scheme on unlabeled data and semi-supervised learning based retraining scheduling scheme.

Limitation of Conventional DNN Acceleration Schemes for Explainable DNN Acceleration
Several systems are proposed to achieve optimal performance and cost-effective scheduling for DNN acceleration in an HPC environment. To perform task scheduling, execution, and visualization, the systems set up a highly-tuned computing pool by distributed resources with a common interface to an auto-run environment that can typically be applied to various types of DNN processes. Their goals are to coordinate DNN tasks on a distributed set of resources while minimizing energy costs and ensuring processing time constraints. S3DNN [12] simultaneously optimizes two conflicting objectives: new supervised streaming and scheduling frameworks, real-time accuracy and throughput that optimize the execution of DNN workloads on GPU in a real-time multitasking environment. Fang, Zhou, et al. [13] proposes QoS-aware effective heuristic scheduling of heterogeneous GPU clusters for DNN inference. Especially, Nexus [28] is a GPU cluster engine to achieve high DNN inference throughput with latency constraints [28], as shown in Fig. 2. It balances DNN workloads and maximizes GPU utilization by performing DNN model placement, profiles based bin packing, and scheduling batching aware DNN inference execution.
However, Nexus [28] only considers the GPU as an accelerator. It assumes the type of accelerators in the HPC environment is homogeneous and has identical performance characteristics in terms of throughput or latency and energy consumption. Also, it processes a DNN model as a unit of a task without considering the variety of computational complexity within a DNN.
When FPGA and GPU process Resnet [25], which consists of CL and FC, respectively, their throughput/watt performances vary on each layer. FPGA offers better throughput/ watt over GPU for processing CL, since the power of GPU is usually higher than those of FPGA and the throughput of FPGA is comparable for those of GPU [5]. However, in the case of FC, GPU offers better throughput/watt over FPGA, because the memory of FPGA is insufficient to process FC. It causes a bottleneck and results in a sharp drop in computing performance.
In heterogeneous accelerator environment, where FPGA and GPU exists, the cooperative scheduling on FPGA-GPU with layer-level management utilizing the variety of complexity in explainable DNN inference, can reduce not only the energy consumption, but also the processing time. The reason is that this scheduling can maximally utilize the energy efficiency of FPGA by allocating more CL tasks to FPGA and avoid the inefficiency by allocating less FC tasks to FPGA. The conventional DNN acceleration makes an inefficient decision to allocate more FC tasks to FPGA because the DNN tasks are assigned on accelerators in units of one DNN model. By using or expanding the functions of Nexus, this optimal performance cannot be achieved in the target heterogeneous HPC environment. It is necessary to newly design a DNN acceleration system with cooperative scheduling schemes based on the layer-level management on FPGA-GPU. Besides, Nexus only focus on DNN inference, not DNN training required after completing DNN inference in the retraining system.

A PROPOSED COOPERATIVE SCHEDULING FOR EXPLAINABLE DNN ACCELERATION IN SATELLITE IMAGE ANALYSIS AND RETRAINING SYSTEM
In this section, we describe a cooperative scheduling for explainable DNN acceleration in satellite image analysis and retraining. Then, we define the problems for the analysis and retraining process and resolve them by cooperative scheduling schemes. Fig. 3 shows our target HPC environment and explainable DNN based retraining framework in satellite image acquisition scenario. A high-resolution remote sensing image (e.g., 30k Â 30k) is generated and cropped to image patches in the satellite on-board system. In order to analyze the image patches, the satellite on-board system periodically transmits the image patches to the HPC ground station. We assume that the image patches, of which total size is D (Bytes), are transmitted and already queued in the beginning of each period. This framework processes the given image patches within a period in order to keep the queue stable. 1 Our cooperative scheduling for explainable DNN acceleration performs the following three steps.

System Model and Key Features
Step 1) Explainable DNN Based Analysis Process. It aims to allocate the analysis inference tasks to the heterogeneous accelerator(FPGA/GPU) with respect to energy cost minimization. The images are applied as input to the target explainable DNN model, which takes object recognition and has an explainable functionality such as Grad-CAM [11]. In order to assist human supervisors in satellite image analysis, the HPC ground station system performs the target model inference (i.e., DNN Model + Grad-CAM) that acts as an AI supervisor and delivers classification results and visual explanations for satellite images to human supervisors. It assists human supervisor's detection ability and improves their reading speed. Faster processing on the target model inference can speed up the reading of human supervisors with more visual explanations.
Step 2) Confidence Threshold Based Adaptive Data Selection. It aims to select the training samples (including unlabeled samples) satisfying the thresholded confidence, which sends to the human supervisor. Training samples selected by data selection process are labeled by human supervisors. From the total input data pool, it aims to automatically and progressively select the most informative data that human supervisors need to label. It organizes the training data D tr including labeled images D L tr as well as unlabeled images D U tr . Using some of the initial labeled data, the model begins incremental learning with the data which has not yet been labeled and is informative to learn. The model will gradually be upgraded through a cycle of learning progress.
Step 3) Semi-Supervised Learning Based Retraining Process. It aims to schedule the available accelerators to achieve service deadline for retraining with minimal energy cost. With the training data D tr composed of the newly labeled samples D L tr and unlabeled training samples D U tr , semi-supervised learning is performed.
Heterogeneous HPC Environment With FPGA-GPU. The HPC environment is a cluster composed of M accelerator nodes, and each accelerator node is composed of a CPU and a plurality of heterogeneous accelerators such as GPU and FPGA. The data is exchanged between the host main memory and the accelerator global memory over a PCIe link. Its bandwidth is denoted by BW pci (Bytes/sec). We assume that the PCIe bandwidths are same in all nodes, since the PCI express bandwidth is considerably high compared to the throughput of each accelerator. The ith accelerator node is denoted by s i , and consists of N i heterogeneous accelerators.

Acceleration Scheduling With Layer-Level Management of Explainable DNN (ASLM) on FPGA-GPU
In step 1 of the HPC ground station system, the ASLM scheme schedules the total input images D for invoking the target explainable DNN model inference tasks efficiently. Its objective is to achieve the minimum energy cost while satisfying latency constraints, denoted as L Inf . To do this, it determines how to allocate the accelerator nodes in the cluster and distribute the input images D to them for invoking and processing the target model inference tasks. Structure of Explainable DNN. A target explainable DNN model inference task consists of one pre-processing component and K model components, to be processed sequentially. The pre-processing component, denoted as DL 0 , represents the pre-processing tasks, such as image I/O, decoding, and batching, to be performed in the CPU before executing the explainable DNN model in the accelerators [14]. The model components, denoted as fDL 1 ; . . . ; DL k ; . . . ; DL K g, represent the individual layers of the target model to be performed in the accelerators such as FPGA or GPU. Due to the nature of the explainable DNN model, there are dependencies between the model components, so the next component proceeds after the current component is processed.
For example, Resnet [25] is composed of the pre-processing component, DL 0 , the model component DL 1 , CL for feature extraction, and the model component DL 2 , FC for classification.
Acceleration Node Allocation. The ASLM scheme determines the acceleration node allocation strategy, denoted as X ¼ ½x 1 ; . . . ; x i ; . . . ; x M , to process the input data D. x i indicates whether to use ith accelerator node or not (x i ¼ f0; 1g). And then, it determines the data assignment to the nodes, D ¼ ½D 1 ; . . . ; D i ; . . . ; D M . The data assignment to each accelerator node will be the maximum amount of data that the accelerator node can process within a given latency service-level objective L Inf .
For the data assignment D i , the accelerator node s i processes each component, fDL 0 ; DL 1 ; . . . ; DL k ; . . . ; DL K g, sequentially. In this process on D i , the input and output data of the component DL k are defined as D i k andD i k , respectively. D i is input to the pre-processing component For the input data D i 0 , the pre-processing component DL 0 is processed in the CPU. In the accelerator node s i , the CPU throughput for the component DL 0 is denoted by Th i 0 . Besides, the idle and active power of CPU for the component DL 0 in node s i are denoted by P i;0 idl and P i;0 act , respectively.
For the model component DL k , the input data D i k is distributed to the N i accelerators 2 via PCIe and is processed in them. The data assignment to the N i accelerators is denoted as The accelerator throughput for the components DL k of N i accelerators in the node s i is denoted by ½Th i;1 k ; Th i;j k ; Th i;N i k . Since we only consider two types of accelerator, GPU and FPGA, each accelerator is one of both. The throughput of each accelerator on DL k is defined as the maximum throughput with the optimal batch size, referring to [28]. Therefore, the throughput of a certain GPU is Th i;GPU , where b Ã is the optimal size and l i;GPU k ðb Ã Þ is the latency of the GPU on DL k with b Ã and D img is the data size of a single input image. The throughput of a certain FPGA is , where l i;FPGA k ð1Þ is the latency of the FPGA on DL k with a single input image. FPGA usually processes a single input image without batching so its batch size is 1. Besides, the active powers of a certain GPU and a certain FPGA in the node s i are denoted by P i;GPU k;act and P i;FPGA k;act , respectively. For simplicity, the idle powers of accelerators are ignored since their idle powers are negligible compared to their active powers. It is usually known that the power of GPU, P i;GPU k;act , is higher than those of FPGA, P i;FPGA k;act , and especially, the throughput/watt of FPGA is higher than those of GPU [5] when the model component DL k is a convolutional layer. To simplify the notations, hereinafter we will omit FPGA or GPU from the throughput and power notation and replace it with the index of an accelerator.
The processing time of the N i accelerators is denoted as Latency Modeling of an Accelerator Node. The processing time pt i for data D i in the accelerator node s i is defined as follows: pt i 0 is the pre-processing time for the explainable DNN model Inference task D i 0 in the accelerator node s i . pt i k is the processing time for the explainable DNN model Inference task D i k in the accelerator node s i . Definition 1. Processing Time of Model Component in an Accelerator Node. The processing time of the model component DL k for the input data size D i k in the accelerator node s i is defined as the sum of the input data transmission time ti i k , the processing time of cl i k and the output data transmission Let BW pci be the PCI express bandwidth of s i . The latency model of s i is described in Fig. 4. The minimum processing time is denoted as pt Ãi k and its optimal data distribution is denoted as ½D Ãi;1 k ; . . . ; D Ãi;j k ; . . . ; D Ãi;N i k . cl i k is the maximum processing time among accelerators, denoted as cl i;j k . Lemma 1. The minimum processing time pt Ãi k has the following relationship with the optimal data distribution ½D Ãi;1 k ; . . . ; D Ãi;j k ; . . . ; D Ãi;N i k to minimize the maximum latency of accelera- Proof. If we assume that the optimal solution D Based on Lemma 1, we prove that the minimum processing time pt Ãi k , can be achieved with the optimal data distribution, in which the processing time of each accelerator becomes equal.
Theorem 1. For the input image D i and the target explainable DNN model, the minimum processing time pt i of the accelerator node s i exists and is derived as follows: Proof. Based on Lemma 1, the optimal input data of D Ãi;j k for the model Component k of Accelerator j is D Ãi;j  Each cl i k is the largest value among the processing time of each accelerator. As a result, the total minimal processing time pt i of the accelerator node s i on the input data D i derived as follows: Since the target model structure is fixed, Based on Theorem 1, we model the minimum processing time of the accelerator node s i on the target model and the maximum input data size to be processed within latency constraints L inf .
Energy Cost Modeling of an Accelerator Node. For the component DL k , the energy consumption in accelerator node s i is defined as follows: (7) where, P i;0 act is active power of CPU for pre-processing in node s i [15], [16]. P i;0 idl is idle power of CPU for pre-processing in node s i . P i;j6 ¼0 k;act is active power of j-th accelerator (FPGA or GPU) for inferencing in node s i . pt i 0 is pre-processing time on CPU in node s i . pt Ãi k is processing time on accelerators in node s i . cl i;j k is main processing time on jth accelerator in node s i .
Hence, the total energy consumption in all the nodes is defined as where x i is the variable of X. ASLM Scheme on FPGA-GPU. In the explainable DNN model inference process for the total input image size D, the goal of the adaptive scheduler is to achieve the minimum energy cost while satisfying the latency constraints L Inf .
The decision variable is X ¼ ½x 1 ; . . . ; x i ; . . . :; x M and the objective function is defined as follows: subject to C 1 : pt i L Inf ; 8i 2 ½M: The constraints C 1 means that The processing time pt i for data D i in the accelerator node s i does not exceed the latency constraints L Inf .
This problem is a mixed-integer linear programming (MILP) problem with integer variables, the allocation strategy X ¼ ½x 1 ; . . . ; x i ; . . . ; x M , and real variables, the data assignment D ¼ ½D 1 ; . . . ; D i ; . . . ; D M . We can solve it with some optimization methods, such as branch and cut and simplex algorithm , provided by CPLEX [22].
However, this MILP problem is NP-hard, so we propose a simple and intuitive heuristic algorithm that finds the nearoptimal solution. In the performance evaluation Section 4, we will show that our heuristic algorithm achieves almost near-optimal solutions.
First, C 1 can be removed by assuming the data assignment to each accelerator node is the maximum data size that the accelerator node can handle until the latency constraints L Inf . That is pt Ãi ¼ L Inf ; 8i.
Based on Eq. (11), the data assignment to the nodes, ½D 1 ; . . . ; D i ; . . . ; D M , is determined Second, it constructs the ranked ratio of energy cost per maximum data size,denoted as S, for every accelerator node Last, the accelerator nodes with low-efficiency are allocated one by one until the given input D can be processed within the latency constraints L Inf X i2½M D i Ã x i < D: The entire process of the ASLM scheme is described in Algorithm 1.

Algorithm 1. Acceleration Scheduling Scheme With Layer-Level Management of Explainable DNN on FPGA-GPU
Input: E i : Energy consumption in accelerator node s i , D i : Data assignment to accelerator node s i Output: S : Ranked ratio of "energy cost per maximum data size" Function ConstructRatio(E i ; D i ): ðS [ fugÞ T T À fug Until T ¼ ; End Function Input: S : Ranked ratio of "energy cost per maximum data size" Output: X : Acceleration Node Allocation Strategy Function NodeAssignment(S):

Adaptive Unlabeled Data Selection (AUDS) With Confidence Threshold for DNN Retraining Cost Reduction
In step 2 of the HPC ground station system, we adopt the algorithm, proposed in [23], for the AUDS scheme which selects the most informative unlabeled data on retraining.
The selected data is passed to human supervisors for labeling. The total input data pool of m classes and n samples is denoted by I ¼ fx i j8i 2 ½ng. The label of x i is denoted as y i . If x i is in the jth class, y i ¼ j; j 2 ½M. While the data scale continues to grow, almost all data cannot be unlabeled due to high labeling cost and slow labeling speed. Utilizing informative unlabeled data is important to accelerate the retraining. Therefore, the AUDS scheme organizes the training data D tr composed of unlabeled images D U tr and labeled images D L tr . The DNN retraining problem is defined as follows: where 1ðÁÞ is the function that 1ðtrueÞ ¼ 1 and 1ðfalseÞ ¼ 0, and W is the model parameters. pðy i ¼ jjx i ; WÞ is the softmax output of the model W on x i for the jth class.
First, the AUDS scheme sorts the total input data pool according to the AL criteria that uncertain samples are often the most informative for model update. And then, those most uncertain samples are manually labeled by human supervisors and added into labeled training data D L tr . Those most certain samples are pseudo-labeled and added into unlabeled training data D U tr . For x i , the model predicts its label as a probability vector pðy i ¼ jjx i ; WÞ 2 ½0; 1 jmj . Let sðxÞ be the confidence of x i . Measuring the maximum confidence that the model has in any class Definition 2. Configuration Threshold. Let u be the confidence threshold of sðxÞ, described in Fig. 5. The high confidence threshold on pðy i ¼ jjx i ; WÞ is often the most beneficial for model updating The high-confidence samples are selected from the total input data pool, whose sðxÞ is higher than the threshold u, and added into unlabeled training data D U tr [23]. The AUDS scheme predicts their pseudo-labels, y i , defined as The initial threshold u is set to the best empirical value to assign high reliability to a pseudo-label.
In the progressive learning process, the high-confidence samples are selected with improved model accuracy. It decreases incorrect pseudo-labeling. To ensure the reliable sample selection, at each iteration t, the AUDS scheme updates the threshold u adaptively as follows: where is the threshold growth rate [23].
Referring to [23], we fix the growth rate to 0.0033 and the threshold u to 0.05. Wang, Keze, et al. [23] show that these parameters do not substantially affect the overall system performance. The entire process of the AUDS scheme is described in Fig. 6.

Retraining Acceleration Scheduling (RAS) Based on Semi-Supervised Learning With Data Parallelism
In step 3 of the HPC ground station system, the RAS scheme schedules the retraining with the training data D tr composed of unlabeled images D U tr and labeled images D L tr . Its objective is to achieve the minimum energy cost while satisfying latency constraints, denoted as L tr . To do this, it determines how to allocate the accelerators in the cluster and distribute the training data D tr to them for invoking the target explainable DNN model retraining tasks Structure of Semi-Supervised Learning Based Retraining. The retraining task is based on  Synchronous SGD, which is one of the data parallelism(DP) methods [18]. The DP reduces one iteration time by dividing the mini-batch size into the allocated accelerators [19]. The mini-batch size of ith GPU is denoted as b i tr . The total batch size is B ¼ P i b i tr in an iteration. The batch B is randomly sampled from the training data D tr . The number of iterations in an epoch is given to Iter ¼ D B and the number of epochs ep is determined by system operators. We cannot use Synchronous SGD directly because of the presence of unlabeled data. As described in Section 3.3, we utilize the informative unlabeled data to accelerate the retraining and overcome the bottleneck due to high labeling cost and slow labeling speed. Therefore, we adopt the semi-supervised learning method, Label Guessing [24], which makes pseudo-labels by invoking the target model inference regarding unlabeled data D U tr . GPU Resource Allocation. For retraining, we use GPUs. The available GPUs after GPU allocation for inference in step 1 are denoted as X tr , which is the allocation vector on GPUs for retraining, as follows: where C is the number of available GPUs. x i tr ¼ 1 means that the GPU i is allocated for retraining and otherwise x i tr ¼ 0.
Latency Modeling in Retraining Process. For batch size B, the retraining latency with allocation Strategy X tr is defined as follows: where l fb is the label guessing latency. According to unlabeled training images D U tr , the label guessing latency with allocation Strategy X tr is defined given by Eq. (22) where Th i tr is the throughput of i th-GPU. We show the label guessing latency through Theorem 1.
Based on the mechanism of Synchronous SGD as described in Fig. 7, an iteration executes one Feed-Forward and Backpropagation (FB) with the batch B of training data D tr on C GPUs. The RAS scheme determines the optimal mini-batch b i to be processed by each GPU for training on B in each FB The FB Latency at ith GPU consists of the processing time l i p ðb i tr Þ for mini-batch b i tr and the transfer time l i c to send the gradient to the parameter server and receive updated models from the parameter server, respectively [20] l i fb ðb i tr Þ ¼ l i p ðb i tr Þ þ l i c : The processing time l i p ðb i tr Þ is defined with the parameters a i and b i implying the performance of ith GPU [20] In the heterogeneous accelerator environment, each GPU has different processing speed characteristics per batch, so if the same mini-batch is determined, straggler occurs and FB operations are getting slow down [20]. b i tr is adjusted to minimize the FB Latency of the accelerator with maximum FB Latency to obtain the minimized FB performance time l fb and mini-batch of each accelerator. It is given to Eq.
If b i tr is a continuous variable, the optimal solution of Eq. (26) on the FB latency is derived as follows: where we assume that the communication latencies of X tr are same as l c . Energy Cost Modeling in Retraining Process. During retraining process, the active processing time and power in GPU are given to l i fb Ã Iter Ã ep þ l lg and P i act , and the idle time and power spent on transmission in GPU are l C Ã Iter Ã ep and P i idl . Pipelining between processing and transmission cannot be applied because the training can be continued for the next batch B after synchronization has been established.
For batch size B and training data D tr , the retraining energy with allocation Strategy X tr is defined as processing time (E tr ) [15], [16] RAS Scheme Based on Semi-Supervised Learning With Data Parallelism. In the retraining process for the training data D tr , the goal of the RAS scheme is to achieve the minimum energy cost while satisfying the latency constraints L tr .
The decision variables are an accelerator allocation strategy X tr ¼ ½x 1 tr ; . . . ; x i tr ; . . . :; x C tr and a confidence threshold u. First, the RAS scheme determines the confidence threshold u adjusting several iterations(Iter) with the training data D tr The initial confidence threshold is obtained using the best empirical value. If the training data D tr selected by u cannot be processed within L tr with whole available GPUs, the RAS scheme reduces D tr and Iter l tr ¼ l fb ðX tr Þ Ã IterðuÞ Ã ep þ l lg > L tr ; where x i tr ¼ 1; 8i 2 ½C: The RAS scheme increases the initial confidence threshold u to reduce the training data size D tr calculated in the AL process (step 2) until L tr is satisfied.
(a) Increase Threshold of Uncertainty Contrary, if the training data D tr selected by u can be processed within L tr with whole available GPUs, the RAS scheme additionally reduces retraining cost E tr with the latency constraints L tr by controlling the accelerator allocation strategy X tr ¼ ½x 1 tr ; . . . ; x i tr ; . . . :; x C tr and the objective function is defined as follows: subject to C 1 : l tr < L tr : However, since it is hard to solve (NP-hard problem), we propose a heuristic algorithm to reduce computational complexity. The objective function P 0 2 to find the optimal retraining cost can be configured in the form of GPU active Power ( P i2½C P i act Ã x i tr ) Â One Iteration Latency (l tr ). For simplicity, the GPU idle power is ignored. We propose a heuristic on accelerator allocation X tr .
First, it starts with setting all items of X tr as 1. It means that we consider using all of available GPUs.
Second, it obtains the energy cost reduction for the processing time increase, when one GPU among X tr is decided not to use DE tr Dl tr ¼ E 6 ¼j tr À E tr l 6 ¼j tr À l tr : Third. it excludes one inefficient GPU which has the highest value on DE tr Dl tr . If the GPU-i is chosen, it set as x i tr ¼ 0. It repeats this process right before the latency constraints L tr is exceeded. This algorithm is finished and uses the result of X tr as a solution to problem P 2 . The entire process of the RAS scheme is described in Algorithm 2. if l tr > L tr then

PERFORMANCE EVALUATION AND DISCUSSION
In this section, we evaluate the processing time and energy cost of the proposed cooperative scheduling schemes for explainable DNN acceleration in satellite image analysis and retraining, comparing with the conventional DNN acceleration schemes. The proposed schemes optimize the pre-processing component as well, reflecting the heterogeneity of CPU, and the evaluation results of the proposed schemes definitely include the effect of the pre-processing tasks. However, we focus on the processing time and energy consumption performance of the main model components processed in accelerators such as CL or FC since the heterogeneity of CPU in our experimental environment is negligible compared to the heterogeneity of accelerators. In this evaluation, the scheduling scheme of Nexus [28] is used as the conventional scheme for satellite image analysis and Homogeneous DP [18] and Heterogeneous DP [20] are used as the conventional schemes for satellite image retraining. The inference scheduling schemes of Nexus, assuming a homogeneous HPC environment, pack the workload each accelerator maximally and it minimize the resource usage considering the latency constraints. Homogeneous DP is the training scheme distributing data evenly across all of available resource [18]. Heterogeneous DP is the training scheme distributing data optimally across all of available resources for straggler mitigation [20].
In addition, we also compare with the optimal solution of our problems, derived from the MILP optimization algorithm provided by CPLEX [22], and show how close the proposed schemes are to the optimal result.
For the testbed establishment, we implement the proposed cooperative scheduling schemes, as shown in Fig. 8. We build the heterogeneous HPC environment with 7 accelerator nodes, each of which contains multiple heterogeneous CPUs, memory, and accelerators(FPGA-GPU) (Geforce series, GTX1080/GTX1080Ti/RTX2080Super, FPGA, Arria 10 GX) as shown in Table 2. The ASLM, AUDS, and RAS schemes determine the best accelerator node, data selection, and GPU allocation, respectively, for a given workload with input images D in a certain period. After the images D are analyzed with the ASLM scheme, the delay occurred from the AUDS scheme and the supervisor's labeling are ignored in this experiment. We assume that the training data D tr composed of unlabeled images D U tr and labeled images D L tr is generated without delay and is directly applied to the RAS scheme for retraining of the target explainable DNN model. Besides, We build a monitoring module for each accelerator node using Nvidia-smi [31] and Xilinx xbutil tool [32] to monitor the GPU and FPGA power usage periodically in real-time.
For evaluation, we use two models as a target explainable DNN model; Resnet-152 (classification) [25], and Faster-RCNN (Object Detection) [21] (but, to evaluate the RAS scheme, we only use Resnet-152). Also, we attach Grad-CAM on the models for visual explanation. We use large-scale image datasets such as DOTA [26] and AID [27], as shown in Fig. 9. We run Pytorch for deep learning framework [29]. We use CUDA 9.0 and cuDNN 7.0 [30] to accelerate the DNN inference speed based on GPU. We measure two metrics: processing time (sec) and energy consumption (J). The processing time means the total time to complete the given workload and the energy consumption means the total energy consumption in the accelerator nodes during processing.

Experimental Results and Analysis
Figs. 10 and 11 shows the results of the ASLM scheme on Resnet-152 and Faster-RCNN, respectively, in terms of processing time (sec) and energy consumption (J) with various workloads. The workloads are set as [3200, 4800, 6400, 8000] (Number of Requests). A request is composed of an input image to invoke the target model inference. The latency constraints L inf is set as 20s for Resnet-152 and 70s for Faster-RCNN, respectively.
In Figs. 10 and 11, the ASLM scheme shows better performance on energy consumption compared to the conventional  Last, the ASLM scheme nearly achieves the optimal result on processing time and energy consumption, derived from the MILP optimization algorithm provided by CPLEX [22].
The results show that the ASLM scheme effectively optimizes the analysis performance with the latency and energy cost modeling, Eqs. (1) and (7), reflecting the heterogeneity of accelerator environment. Especially, its layer-level management on explainable DNN with the latency and energy cost modeling removes the inefficiency of Nexus. Nexus packs the workload into each accelerator as much as possible within the given latency constraints in order to minimize the number of used accelerators. However, the modeling of Nexus, which do not consider the energy consumption of accelerators and assumes the accelerators are homogeneous, makes the inefficient decisions to use the bad throughput/watt accelerators. In addition, since it processes a DNN model as a unit of a task, it cannot avoid allocating FC to FPGA and causes performance degradation. Fig. 12 shows the results of the RAS scheme on retraining Resnet-152, in terms of processing time (sec) and energy consumption (J) with various batch sizes. For simplicity, we shows the average result of an iteration. Also, in this evaluation, we do not consider the label guessing latency because it is negligible in the retraining latency. The batch sizes are set as [64,128,256]. The latency constraints L tr , the training data D tr and the number of epochs ep are set as 300s, 6400 and 40, respectively. Then, the latency constraints on an iteration is set as 0.3s.
In the same way with the ASLM scheme's evaluation, the RAS scheme shows better performance on energy consumption in retraining compared to the conventional schemes  entirely while guaranteeing the latency constraints, 0.3s. In Fig. 12, the RAS scheme reduces energy consumption by {41.54, 15.83, 2.8}% over batch sizes compared to Heterogeneous DP. Homogeneous DP and Heterogeneous DP just use all of available accelerators and decide the data distribution to minimize the processing time. They do not consider the remaining time within the latency constraints which can be utilized to save the energy consumption. Meanwhile, the RAS scheme additionally reduces the accelerator usage to minimize the energy consumption, utilizing the remaining time within the latency constraints. It effectively optimizes the retraining performance with the latency and energy cost modeling, Eqs. (33) and (30), considering the heterogeneity of accelerator environment. Fig. 13 shows the results of the AUDS scheme in terms of accuracy (%) and Number of Training Data (#) and elapsed time(sec). In this experiment, we fix the number of labeled data as 1000 and the unlabeled data might be input into the AUDS scheme in an incremental way. We evaluate it with respect to random sampling based on the fixed threshold. The AUDS scheme achieves competitive accuracy with a smaller number of training samples in comparison to the fixed threshold approach. It provides higher accuracy with the same number of training samples in comparison to the fixed threshold approach.

CONCLUSION
In this paper, we addressed the limitations of the conventional DNN acceleration systems which cause serious performance degradation on energy cost in satellite image analysis and retraining. To overcome these problems, we discussed new explainable DNN acceleration scheduling schemes. Utilizing the latency and energy cost modeling that reflects the layer-level management of explainable DNN in analysis and the confidence level criteria and data parallelism in retraining, we propose cooperative scheduling schemes to minimize the analysis or retraining cost and guarantee the latency constraints. We implemented the cooperative scheduling for explainable DNN acceleration in heterogeneous HPC environment based on FPGA-  GPU and conducted real satellite image analysis and retraining experiments with a large-scale aerial image dataset, such as DOTA and AID. In these experiments, the results showed that the ASLM and RAS schemes provide the optimized processing time and cost performance with respect to explainable DNN acceleration, utilizing the latency and energy cost modeling reflecting the heterogeneity of accelerator environment. In the cases of Resnet-152, the ASLM and RAS schemes reduced the energy cost of their conventional schemes by up to about 40% while guaranteeing the latency constraints. Furthermore, the results showed that the AUDS scheme achieved competitive accuracy with a smaller number of training samples in comparison to the fixed threshold approach. We showed that the AUDS scheme alleviated the bottleneck of supervisors workload and realize the fast processing and convergence of explainable DNN in satellite image analysis and retraining.