An Efficient Task Assignment Framework to Accelerate DPU-Based Convolutional Neural Network Inference on FPGAs

Field Programmable Gate Array (FPGA) has become an efficient accelerator for convolutional neural network (CNN) inference due to its high performance and flexibility. To further improve the performance of CNN inference on FPGAs, an Intellectual Property core (IP core) called Deep Learning Processor Unit (DPU) is released by Xilinx. Unlike previous FPGA-based hardware designs focusing on specific functions or CNNs, the DPU IP supports ample basic functions of deep learning, and the developers can take advantage of DPUs to accelerate CNN inference conveniently. In DPU-based CNN acceleration platform, an encapsulated scheduler plays a crucial role in task scheduling between heterogeneous ARM and multiple DPUs. However, the current scheduler is unsatisfactory because its low schedule efficiency. This paper thus presents a high performance task assignment framework built upon Xilinx hybrid CPU-FPGA MPSoC devices. We first evaluate the main causes of low schedule efficiency problem. Then, we explore the scheduler rules and improve shedule efficiency through purposeful observations and analysis. Finally, we integrate our optimizations, and propose an efficient task assignment framework to maximize performance on DPU-based CNN acceleration platform. Experimental results on Xilinx Zynq UltraScale+ MPSoC zcu104 show that our efficient task assignment framework significantly boosts schedule efficiency for small-scale CNNs (from 36% to 70%), medium-scale CNNs (from 65% to 95%), and large-scale CNNs (from 77% to 99%) compared with original schedule strategy.


I. INTRODUCTION
A. BACKGROUND Convolutional neural network (CNN) has gradually replaced traditional machine vision methods in image recognition, object detection, image segmentation and many other machine vision applications because of its excellent performance [1]- [6]. Generally, each CNN consists of multiple The associate editor coordinating the review of this manuscript and approving it for publication was Gautam Srivastava . layers with different functions, which requires suitable hardware to accelerate its inference process. Meanwhile, many emerging fields, such as intelligent robots, unmanned aerial vehicles, autopilot cars and space probes, have imposed strict restrictions on power, delay and physical size of hardware accelerators, and traditional GPUs are hard to satisfy their requirements [7], [8]. To satisfy the above strict requirements, Field Programmable Gate Array (FPGA) has become a high performance and flexibility accelerator of CNN inference in many emerging fields [9]- [13].
In actual applications, the developers tend to run multiple CNNs in parallel to complete different tasks because those CNNs have different network structures, model sizes and their own advantages. In response to increasingly complex application scenarios, the Intellectual Property core (IP core) called Deep Learning Processor Unit (DPU) is released by Xilinx Inc. Different from previous FPGA-based hardware accelerators focusing on specific functions or CNNs [2], [5], [6], [13]- [16], the DPU IP supports ample basic functions of deep learning, which allows for the efficient implementation of many CNNs on FPGAs.
To strike a balance between flexibility (of being able to handle different functional layers of CNNs) and performance, a more subtle acceleration platform should be designed as heterogeneous architecture, which can take advantage of both CPU and FPGA, like Xilinx Zynq SoC, a socalled hybrid CPU-FPGA device. This type of hybrid devices has been employed in accelerating a wide range of application-specific algorithms, including CNN inference. In our study, the DPU IP is implemented in the programmable logic (FPGA) of the selected Xilinx Zynq UltraScale+ MPSoC zcu104 and integrated into the processing system (ARM) through an AXI interconnect to perform CNN inference, such as image classification, object detection, and semantic segmentation.   1 shows the DPU-based CNN acceleration platform, where the DPU IP is instantiated as DPU0 and DPU1 in zcu104. The process of CNN inference in this platform is as follows. Firstly, the quantizer (Xilinx DECENT tool) is used to optimize the system performance by reducing the model size, and the compiler (Xilinx DNNC tool) is designed to implement CNN inference and memory access by generating the DPU-related instructions. Then, the CNN inference tasks are initialized in ARM and assigned to multiple DPUs by using an encapsulated scheduler. Finally, the results are obtained. Table 1 shows CNN features and parameters supported by the DPU-based acceleration platform, the parameter channel_parallel = 16 is determined by the DPU configuration. In addition, the functions, such as average pooling, ReLU and softmax, are optional and determined during DPU configuration.
Therefore, it is more convenient to accelerate CNN inference by using DPU-based CNN acceleration platform, especially for developers without FPGA knowledge, because it integrates many excellent hardware functions on FPGAs compared with the FPGA hardware accelerators for specific functions or CNNs proposed in the previous studies. To further demonstrate the advantages of the DPU-based acceleration platform, we introduce a representative CNN accelerator (NullHop [17]) released recently as a comparison in Table 2.

B. MOTIVATION
The DPU-based CNN acceleration platform affords ample basic functions of deep learning, supports the fast development flow and provides convenient operation to accelerate different CNN inferences in parallel for developers. In this platform, an encapsulated scheduler assigns tasks to multiple DPUs, and the concept of schedule efficiency is proposed to reflect the utilization of DPUs over a period of time, which can be detected by using Xilinx performance monitoring tool Dsight [18], [19]. In fact, the differences of structures and models of CNNs result in different inference times, and a flexible schedule strategy is required to deal with the complex VOLUME 8, 2020 FIGURE 2. The unbalanced task assignment problem by using one thread and running single network.  and changeable situations. However, the low schedule efficiency problem is exposed as the increasing complexity of application scenarios, which means that the inference tasks take longer to complete their execution. The core ideas of the original schedule strategy (S original ) are as follows: (1) there is a priority between DPUs, that is, inference tasks are assigned to DPU0 if DPU0 and DPU1 are idle (tasks are assigned to DPU1 only when DPU0 is busy); (2) inference tasks simply run on DPUs by using up to concurrent two threads in the order of initialization completion, rather than more efficient method [18], [19]. Fig. 2 to Fig. 4 reveal how the S original leads to low schedule efficiency. Note that an inference in this paper represents the whole calculation process of a feature map (single-frame) from the first layer to the last layer of an CNN. Fig. 2 shows the unbalanced task assignment problem by using one thread and running single network named A. In Fig. 2, the network A consists of two parts, where initialization part runs on ARM and inference part runs on DPU, and the number of inferences is 3. Obviously, the S original cannot evenly assign those tasks to multiple DPUs according to the actual number of inferences, which results in DPU1 being idle during above process. An ideal schedule strategy (our optimization goal) should assign those tasks to different DPUs, thereby improving schedule efficiency. Fig. 3 exposes the unused interval time between two inferences problem by using two threads and running network A. In Fig. 3, although DPU1 is utilized to inferences, the schedule efficiency is still low because the interval time between two inferences is wasted. An ideal schedule strategy (our optimization goal) should take advantage of the interval time on each DPU. Fig. 4 reveals the scheduling confusion problem by using two threads and running multiple networks named A and B. In Fig. 4, network A and network B have different inference time. Therefore, it is uncertain that tasks are assigned to DPU0 or DPU1, and schedule efficiency is hard to be guaranteed in the S original . An ideal schedule strategy (our optimization goal) should ensure that the inference tasks of different networks are controllable.
In summary, the S original has two problems: unbalanced task assignment and unused interval time, which lead to low schedule efficiency.

C. CONTRIBUTIONS
In this study, we aim to improve schedule efficiency on DPU-based CNN acceleration platform. Although the principles of the encapsulated scheduler cannot be modified directly, the execution sequence and the trigger time of each task are controllable by exploring the rules of the scheduler. Therefore, schedule efficiency has the potential to be improved. Our contributions are as follows: (1) We evaluate the main causes of low schedule efficiency problem, and explore the rules of the encapsulated scheduler.
(2) We improve schedule efficiency through observations and analysis, and propose an efficient task assignment framework.
To the best of our knowledge, this is the first work that supports an efficient task assignment framework on DPU-based CNN acceleration platform, which can take advantage of both CPU's generality and FPGA's flexibility. Furthermore, the proposed efficient task assignment framework also presents a remarkable improvement in terms of schedule efficiency in comparison with S original .
The rest of this paper is organized as follows. Section II shows the related work. Section III evaluates the main causes of low schedule efficiency problem, and explores the rules of the encapsulated scheduler. Section IV proposes an efficient task assignment framework to improve schedule efficiency. Section V verifies the performance (schedule efficiency) of our efficient framework and presents the experimental results. Section VI concludes this study.

II. RELATED WORK
Considering that this study focuses on improving schedule efficiency to accelerate CNN inference on DPU-based platform, this section mainly reviews the researches on the optimizations related to CNN inference on FPGAs.
Many researches have accelerated CNN inference by optimizing the network structure and model. In [20], the authors proposed a novel layer-based structured design method for full scalability in constructing CNNs, in which all the CNN layers are optimized and deployed separately and independently. In [21], the authors reduced the parameters and model sizes of CNNs. In [22], all the CNN layers were optimized and deployed to accelerate face feature extraction.
Meanwhile, many excellent algorithms have designed to accelerate the basic functions of CNN inference. In [23], two dedicated computing engines named Conv Engine and Dwcv Engine were designed for pointwise convolution and depthwise convolution to improve the efficiency. In [15], the authors aimed to accelerate sparse CNNs. In [24], an energy-efficient and high-throughput FPGA-based CNN accelerator was proposed. In [25], the fast Winograd algorithm was proposed to reduce the arithmetic complexity and improve the performance of CNNs on FPGAs.
To further improve the performance of CNN inference, the researchers have focused on the FPGA-based SoC architecture. In [26], a new idea was explored and implemented for basic processing element of CNN. In [27], the authors analyzed the on-chip and off-chip memory resources of FPGA, and proposed a memory optimization method. In [28], the accelerator has adopted full parallel pipeline structure to improve the computational efficiency for CNN inference. In [29], the authors proposed a framework to solve boundary problem and connect CNN accelerator with ARM processors and DDR4 memory through dual AXI bus. In [30], the authors proposed a new approach utilizing bit-width partitioning of FPGA DSP resources to improve the performance and resource utilization efficiency of CNN accelerator. In [9], the authors proposed a framework to enable fine-grained customization for CNNs with dynamic reconfiguration on FPGAs. In [17], the authors exploited the sparsity of neuron activations in CNNs to accelerate the computation and reduce memory requirements, and proposed NullHop accelerator.
Because of many excellent optimizations and designs, FPGA has been widely used in actual applications. In [31], an ADAS system was designed and FPGA was used to accelerate its CNN inference. In [3], [32] and [33], FPGA-based real-time reliable target detectors were implemented. In [34], the optimizations for CNN cascade face detection algorithm were proposed. However, until recently FPGAs were mostly programmed in hardware description languages (HDLs), which imply longer time-to-market and require from designers knowledge about hardware details in order for performance to be optimized, resulting in FPGAs being less used than other solutions. Therefore, they are not easy to design and optimize the FPGA-based accelerator of CNN inference in actual projects, especially for users without FPGA development experience. To solve above problem, the DPU-based CNN acceleration platform is used to deal with increasingly complex application scenarios. It is worth mentioning that the DPU-based acceleration platform supports ample basic functions of deep learning compared with previous optimizations and designs. The developers can take advantage of DPUs to accelerate CNN inference efficiently. Therefore, the problem of low schedule efficiency on DPU-based CNN acceleration platform should be studied.

III. THE EVALUATIONS AND EXPLORATIONS FOR SCHEDULE EFFICIENCY PROBLEMS
In this section, we evaluate the main causes of low schedule efficiency problem by running the typical lightweight MobileNet-v2, small-scale Inception-v1, medium-scale ResNet-50, large-scale Yolo-v3 and VGG-16, which are widely used in machine vision. We also explore the rules of the encapsulated scheduler, and propose an observation-based task assignment framework (F observation ). Because of the variety of CNNs, we select above representative CNNs with different scales for following evaluations and explorations.

A. THE PROBLEM OF UNBALANCED TASK ASSIGNMENT
Observation 1: When running inferences with one or multiple threads, the S original assigns too many tasks to DPU0 rather than DPU1, which exposes the problem of unbalanced task assignment.
In DPU-based CNN acceleration platform, schedule efficiency reflects the utilization of DPUs over a period of time proposed by Xilinx Inc, and low schedule efficiency means that the inference tasks take longer to complete their execution [18], [19]. Specifically, for each CNN, schedule efficiency of each DPU is expressed as: where T inference and T interval represent the total inference time and total interval time for each DPU, respectively. For example, one inference time of network A is 3ms and the interval time between two inferences of network A is 8ms. Assume that the number of inferences is 2 for each DPU, that is, T inference = 6ms, T interval = 8ms and SE = 6 6+8 × 100% = 42.9% for each DPU. Table 3 shows schedule efficiency of 40 inferences on different CNNs under 1, 2, 4 and 8 threads. In Table 3, we observe that schedule efficiency of DPU1 is lower than that of DPU0, especially for lightweight MobileNet-v2 and small-scale Inception-v1. Moreover, the improvement of schedule efficiency of lightweight and small-scale CNNs by increasing the number of threads is not satisfactory. That is, although multithreading is an effective method to improve schedule efficiency, the degree of improvement varies with the different CNNs. For example, the maximum schedule efficiency of DPU0 is only 48.3% for MobileNet-v2 (4 threads), it means that DPU0 is idle for more than half of time during 40 inferences. Meanwhile, multithreading can improve schedule efficiency when the number of threads changes from 1 to 4. However, schedule efficiency declines when the number of threads increases from 4 to 8. In other words, the ideal performance cannot always be achieved by increasing the number of threads without limitation.
The analysis of above observations is as follows: (1) Firstly, the problem of unbalanced task assignment is ultimately caused by the DPU priority rule of the S original as mentioned in Section I-B. (2) Secondly, a complete inference process consists of two parts, that is, ARM is responsible for the task's initialization, and then the initialized task is sent to DPU for execution. For small-scale CNNs, the initialization of the next inference may not be completed at the end of previous inference, and the opposite is true for large-scale CNNs. It rationalizes the difference of schedule efficiency for CNNs with various scales. (3) Thirdly,  the upper bound on threads is usually related to the number of CPU's cores (ARM Cortex-A53 is a quad-core processor in zcu104), and the concurrency (parallelism) of four threads is perfectly supported in zcu104. As the number of threads continues to increase, it may aggravate the competition for CPU between threads, resulting in performance degradation.

B. THE PROBLEM OF UNUSED INTERVAL TIME BETWEEN TWO INFERENCES
Observation 2: In DPU-based CNN acceleration platform, the interval time between two inferences on DPUs for CNNs is relatively fixed (initialization time on ARM). And the interval time caused by the communication and data transmission time between FPGA and ARM affects schedule efficiency according to equation (1), especially for small-scale CNNs. In small-scale CNNs, schedule efficiency is low and the interval time even exceeds the inference time. Fig. 5 shows the proportion of average inference time and average interval time (between two inferences) on DPUs for different CNNs. We find that interval time between two inferences of different networks is about the same (5-9ms). Furthermore, the interval time is caused by the communication and data transmission time between FPGA and ARM before the next inference. Meanwhile, the inference time of small-scale CNNs is less than the corresponding interval time, and multithreading cannot effectively improve schedule efficiency. However, the inference time of large-scale CNNs is much longer than the corresponding interval time. If small-scale CNNs can be inserted into the interval time of large-scale CNNs, schedule efficiency can be improved. Table 4 shows the performance of inserting small-scale CNNs (MobileNet-v2 and Inception-v1) into the interval times of large-scale CNN (Yolo-v3). The number of threads is 2 for each CNN and the number of inferences is 20 for each thread. The second column (Yolo) in Table 4 is used as a comparison experiment. From Fig. 5, the total time to complete an CNN inference consists of interval time (initialization time on ARM) and inference time. Assume that DPU0 and DPU1 are responsible for 20 inferences, respectively. The total time for MobileNet-v2 and Inception-v1 to complete 20 inferences on DPUs is (7ms+4ms)×20 = 220ms and (8ms+7ms)×20 = 300ms, respectively. However, when running Yolo-v3 and MobileNet-v2, the total time of DPU0 and DPU1 only increased by 47ms and 17ms compared to comparison experiment, respectively. Similar results can be observed when running Yolo-v3 and Inception-v1. It is worth noting that the total time of DPU0 and DPU1 only increased by about 60ms even when running Yolo-v3, MobileNet-v2 and Inception-v1 simultaneously, which is much smaller than the theoretical total time of 20 inferences for MobileNet-v2 and Inception-v1 ((220 + 300) ms). The above results show that inserting small-scale CNNs into the interval time of large-scale CNNs is a feasible method to improve total schedule efficiency (hybrid CNNs with multithreading). VOLUME 8, 2020

C. THE OBSERVATION-BASED TASK ASSIGNMENT FRAMEWORK
In the previous subsections, we evaluated main causes of low schedule efficiency problem. To satisfy the requirements of the complex and changeable application scenarios, we improve schedule efficiency based on our observations and analysis in this subsection. Fig. 6 shows the relationship between threads and performance when the number of inferences is 40 for each CNN. In each case, 40 inferences are equally assigned to each thread. When the number of threads increases from 1 to 4, schedule efficiency increases and total time of 40 inferences decreases. However, this trend disappears as the number of threads continues to increase for each CNN. The related reasons for the above conclusions have been analyzed in Section III-A, namely, the upper bound on threads is related to the number of processor cores.
Since each CNN has the same trend, the upper bound on threads of each CNN is set to 3 in our observation-based task assignment framework (minimize the number of threads while ensuring sufficient performance for each CNN according to Fig. 6). Furthermore, the total threads of our framework is set to 6, because for each network, increasing the number of threads on this basis can no longer improve schedule efficiency (the upper bound on threads is related to the number of processor cores). The details of the observation-based task assignment framework (F observation ) are explained as follows: (1) Firstly, the maximum threads on each CNN and framework are determined according to actual user requirements (single CNN or multiple CNNs). As mentioned earlier, the upper bound on threads of each CNN and framework is 3 and 6, respectively.
(2) Secondly, the computing tasks are transferred to ARM, DPUs or a candidate queue if multiple CNNs need to be processed.
(3) Thirdly, if the actual project requirements include CNNs with various scales, each large-scale CNN with 3 threads is selected one by one, and then combined with one small-scale or medium-scale CNNs with 3 threads to insert into the interval time of large-scale CNN (according to the order of entering the queue). Note that the inference tasks of each CNN are assigned equally to each thread to reduce the difference of schedule efficiency bwtween DPU0 and DPU1.
(4) Finally, the inferences are running on ARM and DPUs until all computing tasks in candidate queue are completed.

IV. THE EFFICIENT TASK ASSIGNMENT FRAMEWORK
Although the F observation improves schedule efficiency from observations and analysis, it still has three shortcomings: (1) It fails to guarantee ideal performance by just increasing the number of threads.
(2) It fails to improve schedule efficiency based on the features of CNNs with various scales, because the F observation is based on coarse-grained observations.
(3) It fails to support efficient inferences of multiple CNNs, because inferences are performed with multiple threads in out of order.
In this section, we tailor different strategies to improve schedule efficiency for small-scale, medium-scale and large-scale CNNs based on their corresponding inference time and interval time. Then, we integrate our optimizations and propose an efficient task assignment framework to maximize performance on DPU-based CNN acceleration platform. Our framework provide software designers with much more high-level processing power and an efficient way to accelerate CNN inference.

A. SCHEDULE EFFICIENCY IMPROVEMENT FOR SMALL-SCALE CNNs
Due to data transfer between FPGA and ARM, the interval time between two inferences is unavoidable for CNNs. Although the F observation has improved schedule efficiency, the interval time is still hard to be fully utilized. For small-scale CNNs, the inference time is far less than the interval time. If some tasks are inserted into the interval time between two inferences, the schedule efficiency can be effectively increased. Note that an inference in this paper represents the whole calculation process of a feature map (single-frame) from the first layer to the last layer of an CNN. Fig. 7 shows our schedule efficiency improvement strategy for small-scale CNNs, and six threads are set up to accelerate the inferences of small-scale CNNs, called three groups round-robin strategy (ThGRR). The details of the ThGRR strategy are as follows: (1) Firstly, all inference tasks are equally assigned to six threads, and six threads are divided into three groups.
(2) Secondly, group one (thread 1 and thread 2) are launched concurrently to assign inference tasks to DPU0 and DPU1 in parallel. According to the rules of the S original , DPU0 has a higher priority than DPU1, but as long as DPU0 is busy, DPU1 immediately accepts subsequent tasks for execution. Therefore, setting up two concurrent threads as a group can ensure that the two DPUs work in a nearly synchronous manner. As the number of threads increases, the task assignment tends to be chaotic.
(3) Thirdly, three groups of threads round-robin between ARM and DPUs until all inference tasks are completed.
Assume that only two threads (one group) are used, the tasks of two concurrent threads are executed either in ARM or in DPUs at the same time, it is difficult to make full use of the parallel computing power of heterogeneous ARM and DPUs to initializations and inferences simultaneously (the S original in the top half of Fig. 7). Furthermore, in the case of four threads (two groups), when the inference tasks of a group are completed, the initializations of another group are not completed, which still lead to low schedule efficiency. Our ThGRR strategy for small-scale CNNs improves the utilization of heterogeneous ARM and DPUs, and provides a good balance between threads and performance (the F efficient in the bottom half of Fig. 7). Meanwhile, a inference task can be inserted into the interval time if its inference time is less than this interval time. For small-scale CNNs, T inference < T interval is certain, but 2T inference < T interval is not guaranteed. Therefore, the ThGRR strategy ensures high schedule efficiency and avoids competition between threads. As the thread groups continue to increase, many tasks that have completed initialization will not have the appropriate interval time to be inserted. Otherwise, threads compete for DPUs, which usually leads to poor schedule efficiency. VOLUME 8, 2020

B. SCHEDULE EFFICIENCY IMPROVEMENT FOR MEDIUM-SCALE CNNs
For medium-scale CNNs, the inference time is about equal to the interval time. Different from the ThGRR strategy adopted by small-scale CNNs, medium-scale CNNs has longer inference time. Therefore, the three groups with delay round-robin strategy (ThGDRR) is proposed to avoid competition for DPUs between threads. Fig. 8 shows our ThGDRR strategy for medium-scale CNNs. Six threads (three groups) are set up to accelerate the inferences of medium-scale CNNs. Similar to small-scale CNNs, the three groups round-robin strategy is adopted as a basis, which has been introduced in the previous subsection. But the difference is that initialization time is about equal to inference time for medium-scale CNNs, and the delay between thread groups should be added to ensure that there is no competition between inference tasks. The delay should satisfy: (2) where T interval and T inference can be determined by Xilinx performance monitoring tool Dsight, so as to ensure that inference tasks of medium-scale CNNs can be completely inserted into new interval time. If no delay is added between thread groups, the competition among tasks is inevitable, which affects performance (schedule efficiency cannot be guaranteed). Meanwhile, it is unnecessary to continue to add thread groups, because at most one inference task can be inserted in interval time after adding a delay.

C. SCHEDULE EFFICIENCY IMPROVEMENT FOR LARGE-SCALE CNNs
For large-scale CNNs, the inference time is much longer than the interval time. Therefore, four threads are set up to accelerate the inference tasks of large-scale CNNs, and it is enough to satisfy the requirement of high schedule efficiency. Fig. 9 shows our two groups round-robin strategy (TwGRR) for large-scale CNNs.
(1) Firstly, group one (thread 1 and thread 2) are launched simultaneously to execute inference tasks on DPU0 and DPU1 in parallel.
(2) Secondly, when inference tasks of group one are completed, inference tasks of group two (thread 3 and thread 4) are sent to DPU0 and DPU1 simultaneously, and group one starts to initialize their next inference tasks. Because the inference time is much longer than the interval time, that is, when inference tasks of group one are not completed on two DPUs, the inference tasks of group two have initialized. Therefore, TwGRR strategy is enough to satisfy high schedule efficiency.

D. AN EFFICIENT TASK ASSIGNMENT FRAMEWORK
In this subsection, an efficient task assignment framework (F efficient ) is proposed. Different from the F observation , where multiple CNNs are mixed and executed in out of order and competition between inference tasks is inevitable. In fact, our F observation only considers the impact of thread's increase on performance. The proposed F efficient takes the competition among threads into consideration, thus further improving the schedule efficiency. The F efficient pays more attention to high performance implementation of each CNN (all CNNs are executed one by one in F efficient ). The details of F efficient are as follows: (1) Firstly, the F efficient determines large-scale, mediumscale and small-scale CNNs according to the proportion of inference time and interval time. It is easily determined by using Xilinx performance monitoring tool Dsight.
(3) Finally, all computing tasks in candidate queue are completed.

V. EXPERIMENTS
This study aims to address the low schedule efficiency problem on DPU-based CNN acceleration platform by proposing an efficient task assignment framework (F efficient ). The performance metrics selected for comparison is total time by running experiments of small-scale, medium-scale, large-scale and multiple scales CNNs. The compared frameworks with the presented F efficient are the official original strategy (S original ) and our observation-based task assignment framework (F observation ). All inference tasks are executed on Xilinx UltraScale+ MPSoC zcu104 platform with heterogeneous Cortex A53 ARM and ZU7 FPGA (two DPUs).

A. INFERENCES FOR SMALL-SCALE CNNs
In our study, CNNs are divided into three types: large-scale, medium-scale and small-scale. Our F efficient supports corresponding task assignment strategies (ThGRR, ThGDRR and TwGRR) for different CNNs. Assign inference tasks equally to each thread; 5: Execution; 6: else if (multiple CNNs) then 7: All CNNs enter the queue; 8: flag_large=0; 9: flag_other=0; 10: while (queue!=null) do 11: if (a large-scale CNN in queue&&flag_large==0) then 12: Select one as the inference tasks (3 threads); 13: Assign inference tasks equally to each thread; 14: flag_large=1; 15: end if 16: if (a medium-scale CNN or small-scale CNN in queue&&flag_other==0) then 17: Select one as the inference tasks (3 threads); 18: Assign inference tasks equally to each thread; 19: flag_other=1; 20: end if 21: Execution; 22: if (the large-scale CNN is completed) then 23: flag_large=0; 24: else if (the other CNN is completed) then 25: flag_other=0; 26: end if 27: end while 28: end if 29: Obtain the results.  Table 5 shows the total time of lightweight MobileNet-v2 and small-scale Inception-v2 in differece task assignment frameworks, and inference time is less than interval time for small-scale CNNs. In Table 5, forty inference tasks are evenly Algorithm 2 The Efficient Task Assignment Framework (F efficient ) Require: CNNs with various scales Ensure: Classification/detection/segmentation results 1: All CNNs enter the queue; 2: while (queue!=null) do 3: Select one as the inference tasks according to the order of entering the queue; 4: if (small-scale CNN) then 5: Assign inference tasks equally to each thread; 6: Execution (ThGRR strategy); 7: else if (medium-scale CNN) then 8: Assign inference tasks equally to each thread; 9: Execution (ThGDRR strategy); 10: else if (large-scale CNN) then 11: Assign inference tasks equally to each thread; 12: Execution (TwGRR strategy); 13: end if 14: end while 15: Obtain the results. assigned to two threads, three threads and six threads for S original , F observation and F efficient , respectively. According to Fig. 5, the total time of one inference of MobileNet-v2 is composed of inference time of 4ms and interval time of 7ms, while the total time of one inference for Inception-v1 is composed of inference time of 7ms and interval time of 8ms. If using the S original with single thread, it takes 440ms and 600ms to complete 40 inferences for MobileNet-v2 and Inception-v2, that is, ARM (initializations) and DPU0 (inferences) execute serially, while DPU1 is idle. When using the S original with two threads, the total time to complete 40 inference tasks is reduced to 220ms and 300ms for each DPU, respectively. Apparently, multithreading strategy accelerates inferences of MobileNet-v2 and Inception-v2, and improves the schedule efficiency significantly. However, the schedule efficiency of MobileNet-v2 and Inception-v2 can still be improved. Although the S original makes use of DPU1, the interval time between two inferences still has potential to be exploited. Specifically, in the S original , 40 inference tasks are equally assigned to two DPUs (each DPU undertakes 20 inference tasks). For above two small-scale CNNs, the total inference time of each DPU is 4 × 20.80ms and 7 × 20.140ms, and the total interval time of each DPU is 7 × 20.140ms and 8 × 20.160ms, respectively. Schedule efficiency is improved in our proposed F observation and F efficient . In our F observation , the total time of each DPU is reduced to 164ms and 245ms for above two small-scale CNNs, respectively. That is, the unused interval time of each DPU is shortened to 84ms and 104ms, while in the S original , the number is 140ms and 160ms respectively. In our F efficient , the total time of each DPU is further reduced to 115ms and 210ms for MobileNet-v2 and Inception-v2. And the total interval time of each DPU is 35ms (140ms in the S original ) and 70ms (160ms in the S original ) for MobileNet-v2 and Inception-v2. Compared with the S original , the total time of each DPU is reduced by about 100ms and 90ms, respectively. Table 6 shows the total time of medium-scale ResNet-50 in differece task assignment frameworks, and inference time is about equal to interval time for medium-scale CNNs. In Table 6, forty inferences are evenly assigned to two threads, three threads and six threads for S original , F observation and F efficient , respectively. According to Fig. 5, the total time of one inference of ResNet-50 is composed of inference time of 14ms and interval time of 7ms. If using the S original with single thread, it takes 840ms to complete 40 inferences for ResNet-50. When using the S original with two threads, the total time to complete 40 inference tasks is reduced to 420ms for each DPU. In the S original , the total inference time of each DPU is 14 × 20.280ms, and the total interval time of each DPU is 7 × 20.140ms. Schedule efficiency is improved in our proposed F observation and F efficient . In our F observation , the total time of each DPU is reduced to 361ms for ResNet-50. That is, the unused interval time of each DPU is shortened to 81ms, while in the S original , the number is 140ms. Furthermore, in our F efficient , the total time of each DPU is reduced to 295ms, and the total interval time of each DPU is 15ms for ResNet-50. Compared with the S original , the total time of each DPU is reduced by about 125ms. Table 7 shows the total time of large-scale Yolo-v3 and VGG-16 in differece task assignment frameworks, and inference time is much larger than interval time for large-scale CNNs. In Table 7, forty inferences are evenly assigned to two threads, three threads and four threads for S original , F observation and F efficient , respectively.

C. INFERENCES FOR LARGE-SCALE CNNs
According to Fig. 5, the total time of one inference of Yolo-v3 is composed of inference time of 28ms and interval time of 8ms, while the total time of one inference for VGG-16 is composed of inference time of 36ms and interval time of 6ms. In the case of using the S original with single thread, it takes 1440ms and 1680ms to complete 40 inferences for Yolo-v3 and VGG-16. When using the S original with two threads, the total time to complete 40 inference tasks is reduced to 720ms and 840ms for each DPU, respectively. For above two large-scale CNNs, the total inference time of each DPU is 28 × 20.560ms and 36 × 20.720ms, and the total interval time of each DPU is 8×20.160ms and 6×20.120ms, respectively. In our F observation , the total time of each DPU is reduced to 602ms and 755ms for above two large-scale CNNs, respectively. That is, the unused interval time of each DPU is shortened to 42ms and 35ms, while in the S original , the number is 160ms and 120ms respectively. In our F efficient , the total time of each DPU is further reduced to 578ms and 725ms for Yolo-v3 and VGG-16. And the total interval time of each DPU is 18ms (160ms in the S original ) and 5ms (120ms in the S original ) for Yolo-v3 and VGG-16. Compared with the S original , the total time of each DPU is reduced by about 142ms and 115ms, respectively. Table 8 shows the total time of five CNNs with various scales in differece task assignment frameworks. In Table 8, twelve inferences of each CNN are assigned to two threads, three threads and four (or six) threads for S original , F observation and F efficient , respectively. Note that the upper bound on threads is 6 in the F observation , and schedule efficiency is not listed here because it involves multiple CNNs (the results of the first three experiments can be used as a reference). In the case of using the S original with single thread, it takes 1500ms to complete 12 × 5.60 inferences for five CNNs with various scales. When using the S original with two threads, the total time to complete 60 inference tasks is reduced to VOLUME 8, 2020 750ms for each DPU. In our F observation , the total time of each DPU is reduced to 638ms, while in our F efficient , the total time of each DPU is further reduced to 598ms. Compared with the S original , the total time of each DPU is reduced by about 152ms.

VI. CONCLUSION
This study aims to present a high performance task assignment framework built upon Xilinx hybrid CPU-FPGA SoC devices with DPU IP. We first evaluate the main causes of low schedule efficiency problem. Then, we explore the scheduler rules and improve shedule efficiency through observations and analysis. Finally, we integrate our optimizations, and propose an efficient task assignment framework to maximize performance on DPU-based CNN acceleration platform. Experimental results on Xilinx Zynq UltraScale+ MPSoC zcu104 show that the efficient task assignment framework significantly boosts schedule efficiency for inferences of CNNs with various scales compared with original schedule strategy.
In the future, we will optimize the problems that still exist on the current DPU-based acceleration platform: (1) the existing platform only supports the acceleration of deep learning-related functions even if there are still a lot of computing resources available (the computing resources occupied by DPU IP are adjustable according to actual needs), which requires software-hardware re-development and provides related driver support for the entire platform; (2) on the basis of (1), an efficient task scheduler between ARM, DPUs, and custom hardware accelerators should be designed to maximize system performance. Professor with the College of Information Engineering, Xiangtan University. He has published more than ten refereed journal articles. His current research interests include the Internet of Things, compressed sensing, wireless networks, and 5G and parallel distributed systems. He is a member of the CCF.
JIANQI LI received the Ph.D. degree in control science and engineering from Central South University, Changsha, China, in 2013. He is currently a Professor with the Department of Communication and Electrical Engineering, Hunan University of Arts and Science. His main research interests are in intelligent information processing, image processing, and pattern recognition. VOLUME 8, 2020