ANNETTE: Accurate Neural Network Execution Time Estimation with Stacked Models

With new accelerator hardware for DNN, the computing power for AI applications has increased rapidly. However, as DNN algorithms become more complex and optimized for specific applications, latency requirements remain challenging, and it is critical to find the optimal points in the design space. To decouple the architectural search from the target hardware, we propose a time estimation framework that allows for modeling the inference latency of DNNs on hardware accelerators based on mapping and layer-wise estimation models. The proposed methodology extracts a set of models from micro-kernel and multi-layer benchmarks and generates a stacked model for mapping and network execution time estimation. We compare estimation accuracy and fidelity of the generated mixed models, statistical models with the roofline model, and a refined roofline model for evaluation. We test the mixed models on the ZCU102 SoC board with DNNDK and Intel Neural Compute Stick 2 on a set of 12 state-of-the-art neural networks. It shows an average estimation error of 3.47% for the DNNDK and 7.44% for the NCS2, outperforming the statistical and analytical layer models for almost all selected networks. For a randomly selected subset of 34 networks of the NASBench dataset, the mixed model reaches fidelity of 0.988 in Spearman's rank correlation coefficient metric. The code of ANNETTE is publicly available at https://github.com/embedded-machine-learning/annette.


Introduction
Deep Neural Networks have become key components in many AI applications, including autonomous driving Fruhwirth-Reisinger et al. [2020], medical diagnosis Rajpurkar et al. [2017], Wess et al. [2017] and machine translation Zhang and Zong [2015]. The computational intensity of some AI applications based on DNNs prevents their use on embedded system platforms, as these algorithms often have to meet latency and performance requirements to fulfill their purpose.
Attempting to close the gap between the computational intensity of DNNs and the available computing power, a wide variety of hardware accelerators for DNNs and other AI workloads have emerged in recent years. A considerable amount of research has improved the efficiency of DNNs and reduced their memory consumption by applying methods such as pruning Tung and Mori [2018], Srinivas and Babu [2015], quantization Wess et al. [2018], Shin et al. [2016], Miyashita et al. [2016], and factorization Jaderberg et al. [2014], Tai et al. [2016]. Alternatively, a network architecture that is expected to work efficiently on the target device can be designed and trained directly. Networks like MobileNet Sandler et al. [2018] and ShuffleNet Zhang et al. [2018] are specifically designed to reduce the number of Multiply-Accumulate operations (MACs), but they contain specific layer types that are not necessarily optimal for all hardware types. In addition, computational efficiency depends largely on the specific architectural parameters of each layer and the hardware platform used Yang et al. [2018a].
Finally, also the mapping toolchain optimizing the original network graph for the selected hardware platform has to be considered since many hardware accelerators allow specific combinations of layers to be fused together to reduce inter-layer data transfer and/or to optimize data flow. Therefore, when optimizing the network architecture towards "direct metrics" such as latency or energy consumption, "indirect metrics" such as Floating Point Operation (FLOP) or memory footprint can serve as a starting point but do not take into account the platform-specific non-linearities. As a result, networks optimized towards "direct metrics" considerably outperform "indirect" optimized architectures in terms of the selected metrics Yang et al. [2018a], Cai et al. [2019]. On the one hand, the enormous design space for neural network architectures makes it difficult to design a network that runs at high efficiency on all hardware architectures. On the other hand, not all networks work with the same efficiency on a given platform. For example, Fig. 1 shows the effective compute performance when running 12 networks used for evaluation in this paper on a ZCU102 Xilinx MPSoC evaluation board. Furthermore, the computational roofline shows the maximum reachable, effective compute performance.  Table 2 on the Xilinx ZCU102 evaluation board.
We can see the high variance of the effective compute performance for a variety of different network architectures when executed on the same hardware. Due to the large differences in effective compute performance, we can conclude that it is not sufficient to divide the number of operations of a network by the peak compute performance of the target device to achieve a satisfying estimation of the network execution time. So when aiming for field deployment, it is difficult to choose a specific hardware platform before deciding on the network architecture. As a result, there have been some recent attempts to predict network latency and performance on different hardware platforms. However, most of the work targets either Graphic Processing Units (GPUs) Qi et al. [2017], Cai et al. [2017] server or the embedded Central Processing Units (CPUs) Yao et al. [2018], Velasco-Montero et al. [2019], leaving out a wide range of hardware accelerators such as Field Programmable Gate Arrays (FPGAs) and hardware specifically designed for AI tasks e.g. Xilinx ZCU102 and Intel NCS2. In this work we aim to model performance of such DNN hardware accelerators. Also, the existing work does not take into account the graph optimizations undertaken by the compiler, which leads to changes in the accuracy of the prediction.
Therefore, we propose a framework for the generation of stacked, mapping models and layer models to estimate the network execution time. To our knowledge, this is also the first work in which the different approaches to modeling layer execution time and mapping models are systematically investigated and evaluated on a broad range of network architectures.
This paper makes the following key contributions: • We introduce Accurate Neural Network Execution Time Estimation (ANNETTE), a time estimation framework that allows predicting the execution time of Deep Neural Networks on hardware accelerators based on a stacked modeling approach of mapping models and layer-wise estimation models • We propose mixed models for layer execution modeling to decrease the necessary model complexity of the statistical models to cover also computational utilization inefficiencies • We propose a methodology to extract mapping models and layer execution models from micro-kernel and multi-layer benchmarks. Our evaluation of the generated mapping models and layer models on a set of 12 state-of-the-art models show a mean absolute percentage error of 3.41% for the ZCU102 • We compare mixed layer models with statistical layer models, the roofline model, and a refined roofline model in terms of accuracy and fidelity generate a super network, in which each cell is selected from a search space. They approximate latencies by building look-up tables for the selected blocks within the design space to save time. All three works could profit from a uniform estimation framework that accurately predicts performance for multiple platforms. In NetAdapt Yang et al. [2018b], empirical measurements on a Google Pixel 1 CPU are used to construct layer-wise look-up tables to shrink a pre-trained MobileNetV1 until the resource constraints are met to optimize DNNs for inference on mobile devices. FBNetV3 uses multi-use predictors to power their neural architecture search algorithm by predicting architecture statistics such as accuracy and the proxy metrics FLOPS and number of parameters.
NeuralPower Cai et al. [2017] is an attempt to estimate execution latency, power, and as a result, overall energy consumption based on layer-wise sparse polynomial regression for GPU platforms. In terms of execution time estimation, NeuralPower achieves an average accuracy of 88.24% on the networks VGG-16, AlexNet, NIN, Overfeat, CIFAR10-6conv. In addition to the layer-wise time estimation, the same modeling method is also applied to estimate power and finally energy consumption with even higher accuracy. FastDeepIoT Yao et al. [2018] uses execution time models based on linear model trees to predict the layer execution time on the devices Nexus 5 and Galaxy Nexus to finally compress VGGNet for both devices and reduce the neural network execution time by 48% to 78% and energy consumption by 37% to 69% compared with the state-of-the-art compression algorithms. In PreVIous Velasco-Montero et al.
[2019], the execution time models are based on linear regression, and for the devices, Raspberry 3 and Odroid-XU4 reaches about 96% average accuracy for the layer-wise estimation. These results lead us to believe that the task of estimating layer execution times for task optimized computing architectures is significantly more challenging than for CPUs. Therefore, we propose a methodology for generating stacked mapping and layer execution time models for hardware accelerators and systematically compare the prediction accuracy of different modeling approaches. Other than that, MLPAT Tang and Xie [2018] and DNN-Chip Predictor Zhao et al. [2020] propose white box approaches to estimate timing, power and energy. MLPAT reports only 10% error when predicting the power of the TPU-v1. DNN-Chip Predictor's predicted performance differs from those of measurements of FPGA/ASIC implementation by no more than 17.6% when evaluated for two DNNs on three accelerator architectures.
3 Annette Architecture Fig. 2 shows an overview of the proposed framework, allowing us to generate abstraction models for the hardware platform and the mapping toolchain. The Benchmark Tool generates networks, which are then optimized by the provided mapping toolchain for the selected platform. During the benchmark phase, we execute the generated models on the target device and extract detailed layer execution times. We rely on the provided platform tools for mapping, inference, and profiling. With the collected profiling information, the Model Generator can create abstraction models of the graph optimizations and the different layer types. These abstraction models are used in our Estimation Tool to predict the performance of a network without compiling and executing the model. Furthermore, detailed insights are gained to produce efficient networks for the modeled hardware devices.

Benchmark Tool
(1) Figure 2: Overview of the Annette architecture: In the benchmark phase (1), first the platform benchmarks are performed and then the platform models are generated. In the estimation phase (2), the Estimation Tool reads a network description graph, and provides an estimated network execution time, a detailed layer-wise execution time prediction table, and a predicted execution graph.

Benchmark Tool
For platform characterization, we make use of two kinds of benchmarks: micro-kernel benchmarks and multi-layer benchmarks. We aim to characterize the computational efficiency of a hardware platform when executing only a specific layer with the micro-kernel benchmarks. On the other hand, multi-layer benchmarks give us a deeper understanding of which kind of layers are executed separately and which layers can be fused, reducing the off-chip data movement. Fig. 3 depicts the workflow of the Benchmark Tool.
We define a benchmark as one parametric network graph profiled several times on the hardware with different input resolutions or kernel-sizes. Each generated network architecture stays the same for each benchmark, while only the layer parameter settings (e.g. number of channels and kernel size) are changed according to the configuration file. The input for each benchmark is a configuration and a graph description file. The configuration file defines the parameter settings of each measurement. The graph description file defines the architecture of the benchmarked dummy network. The Graph Generator module builds the network models based on the description and configuration information and feeds it to the hardware-specific modules. In each hardware module, the network graph is initially optimized and compiled by the platform mapping toolchain. The optimized graph is then inferred on the target device in a platform specific benchmark application. Then, a report is generated with the help of the platform profiling tool. Finally, the report is parsed into a standard format, and the Graph Matcher compares the collected layer data with the original input network.

Model Generator
Benchmark Graph Matcher Graph Generator Figure 3: The Benchmark Tool profiles the provided set of benchmark models with different configuration settings on the hardware platforms and generates layer data files Running each benchmark separately is a time-consuming task of about three to five days per benchmark, as each model must go through the entire compilation toolchain before the desired measurements can be made. Therefore, we have developed some network models that allow us to measure several kernels within a benchmark run. In this case, the models must be constructed in a particular manner so that the compiler cannot fuse layers or that the computational effort for an operation is not increased. It would result in measuring more than just the desired micro-kernel. Those linear network graphs still count as micro-kernel benchmarks since we still measure the execution time of each layer individually. It must be taken into account that when using graphs with more than one layer for micro-kernel benchmarking, the maximum allowable layer size may be smaller than when measuring a single layer. The choice of configuration parameters has an additional influence on the benchmark wall time. It also influences the insights that can be gained from the collected data and on the understanding of how to model the accelerator This topic is discussed in Section 5.
We use the same mechanism for the multi-layer benchmarks, with the difference that they have more configuration parameters. The goal of these benchmarks is primarily to model the mapping toolchain and understand which optimizations the graph optimization toolchain can perform, but also to be able to benchmark multiple layers at the same time.
The Graph Generator builds the network models, which are benchmarked on the target platform, based on the graph description and the configurations table. It iterates through the configurations table generating one network model per parameter setting. We apply micro-kernel benchmarks for 2D convolution, 2D depth-wise separable convolution, max pooling, average pooling, and fully connected layers, with values in the range from 8 to 2048 for height (h), width (w), number of input channels (c), number of filters (f), input and output neurons, kernel sizes (k h , k w ) 1, 3, 5, and 7, and pooling sizes from 2 to 10, resulting in a total of about 35k measurements per layer. Figure 4 illustrates the network architectures used for the multi-layer benchmarks. All convolution layers are followed by batch normalization and ReLU layers.
The Hardware Modules are simple scripts that automatically call the platform optimization (Graph Optimizer) and compilation toolchain to prepare the benchmark models for inference. In the case of DNNDK, as a developing framework for the hardware module Deep Neural Network Processing Unit (DPU) on the ZCU102 MPSoC board, optimization and compilation functionality are provided through the Deep Compression Tool (DECENT) and the Deep Neural Network Compiler (DNNC) respectively Xilinx [2020a]. In the case of NCS2 the graph is optimized and  . Similarly, we rely on provided execution and the platform specific profiler applications (Profiler App) to extract the layer execution times for the compiled networks. To avoid measurement errors, we average the results of 20 iterations. Finally, a Report Parser extracts the layer-specific information and maps it back to the original graph, comparing the executed layers with the original layers by their names. Therefore, the Profiler App must provide execution times and layer names. The execution information is stored in a standardized format so that the Graph Matcher can process the provided data in the same way for each platform. These encapsulated hardware modules make it easy to add future hardware to the benchmarking tool.
In addition to the Report Parser, the Graph Matcher extracts information about the differences between the original input graph and the final net graph executed on the target device. While the parser merely ensures that there are no changes to the original naming scheme and provides a standardized output, the Graph Matcher extracts additional information about the optimization behavior of the mapping toolchain. The Graph Matcher creates a layer result file for each executed layer and an optimization mapping file for the entire benchmark. The layer result files contain information about the layer parameters, e.g., height, width, number of input channels, and the resulting execution times.
To track the behavior of the mapping toolchain, we also store ternary variables that successive layers have been fused with the measured layer. This merging variable can store the following states: not-fused, fused, and possibly-fused. Possibly-fused is used because, it is not possible to detect where the layer has been merged or not for layers with multiple inputs.  5 shows an example of how a graph could be optimized by the mapping toolchain. In the specific example we set the fused flags for BiasAdd, Activation operation in the Convolution layer of block 1 and 2 to fused. In block 2, the fused flag for the Pooling operation is set to fused as well. Here, it is important to note that since the pooling layer also has a set of parameters, i.e., pooling height, pooling width, pooling stride, and pooling type that define its execution policy, we also need to add those parameters to the already existent stored parameters for the convolution layer. It enables the graph optimizer modeler to extract rules that define in the case of which parameter combinations the layers can be fused. Since the element-wise addition layer may have been matched to either block 1 or block 2, the fused flag for the Add operation is set to possibly-fused in both blocks.

Input
The generated layer data consists of a table for each layer type that for each measurement contains the parameter settings of the layer e.g. height, width, channels, kernel size as well as the measured execution time. This data is then fed to the Model Generator to extract optimization and layer models for the final estimation step.

Model Generator
This section explains how we model the graph optimizations of the mapping toolchain and the computational efficiency of the hardware platforms to achieve better overall latency estimation accuracy. As depicted in Fig. 1, not all networks are computed with the same efficiency when compared to the number of operations in the convolution layers. There are two leading causes of the non-linear nature of the relationship between the number of operations and execution time. First, the non-convolutional layers cannot be neglected. They are not considered in the commonly claimed number of operations, such as element-wise addition, concatenation, activation, or pooling. It is crucial for the execution time of these layers, whether they are executed in isolation or connection with a convolution layer Alwani et al. [2016]. The second factor is that the utilization of computational resources for the same layer can depend on the parameter settings on a specific layer (e.g., height, width). It means that two compute-bound layers with the same number of operations but with differently shaped input and weight tensors are not necessarily computed with the same efficiency Yao et al. To cover all these aspects, we propose a stacked model approach to model the overall network execution time accurately. Fig. 6 shows how the different models are fused for the generation of the platform model. Tab. 1 describes the parameters for the models extracted from the benchmarks. The first performed benchmarks are input parameter sweeps to determine the unrolling parameters s and α. These parameters describe the amount of parallel performed multiplications in per dimension of the compute architecture and the parallelization efficiency. With the help of these two parameter vectors, Statistically computed efficiency in layer n P effn Effective performance in layer n (ops/sec) Predictionŝ T n Estimated time of layer n (sec) we can construct a model that describes the utilization efficiency of several compute architectures (e.g. systolic arrays). Additionally, preliminary values of P peak and B peak are determined, which describe the peak performance and the peak off-chip bandwidth.
These parameters are determined automatically based on measurements or knowledge of the computing architecture.
Once determined, the parameters are fed back to the Benchmark Tool to adjust the parameter settings for the succeeding benchmarks. The rest of the micro-kernel benchmark results are used to generate the Roofline Model by deducing the final values of P peak and B peak , which together with the previously determined unrolling parameters, construct the Refined Roofline Model. We combine the Statistical Model and the Refined Roofline Model in the Mixed Model.
For the final Platform Model, we add the Mapping Model, which covers optimizations performed on the graph before the actual execution.

Layer Execution Time Models
For the construction of layer-level execution time models, we rely on the measurements performed in the benchmarks. We construct parametric analytical models for the convolution, the depth-wise separable convolution, the fully connected, and pooling layer. The selection of these layers is motivated, similarly as in the works Qi et al. [2017], Cai et al. [2017], by the fact that these are the most computational intense layers and, therefore, most critical. However, we will also show that it is also crucial for more complex network architectures to model different layers to achieve accurate results with high fidelity. While the simple roofline model describes most layers with satisfying accuracy, we refine the roofline model for the convolution layer to increase the estimation accuracy.

Analytical Models
For the estimation framework to always work with at least the most simple model, we implement the roofline model Williams et al. [2009] for all layer types as a fallback solution. In the roofline model for each layer n, smallest achievable execution time is either limited by the peak computational performance P peak or the maximal bandwidth B peak . In layer n, with the data to be transferred D n and the number of operations f n give us the estimated execution timeT roofn with the effective computation performance P eff equal to P peak Keeping in mind that for fused layers the term of D n has to be corrected (see Section 5.2), this formulation of the roofline model can be applied to the four named layer types and will be denoted in the experimental section as roofline model.
However, as mentioned earlier, computational efficiency also depends on how the shapes of the input-, weight-and output tensors are mapped on the computing architecture. When incorporating the reduced utilization efficiency u eff in equation (1) we obtainT Next we aim to describe the utilization efficiency of a general compute architecture with an array of Processing Elements (PEs). The number of spatial dimensions A and the number of PEs alongside each dimension s ∈ N A define the compute architecture. For example an array could be described with A = 2 and s = (16 12), which amounts to a total of 192 PEs. When computing a layer, the operations have to be mapped onto the array either spatially or temporally.
With the parameter settings of the layer as the feature vector x we can approximate the utilization efficiency with Hereby the size of the vector x does not have to match the size of the vector s as the operations can also be mapped in the temporal dimension. For example, when mapping a 2D 1x1 convolution layer with a 12x6x128 input feature map and 256 output channels, the feature vector describing the layer could be any permutation of (12 6 128 256 1 1) depending on the mapping of the layer onto the array. With equation (3), for the presented example case and the input feature map height and width mapped spatially onto the 16x 12 array, we would get u eff = 0.375.
It has to be mentioned that equation (3) neglects the overhead of control units and warming up as well as possible input parameter augmentation for x i < s i . For example, since the first layer in most DNNs has three input channels (x i = 3), channel augmentation can often improve performance in the first layer of the neural network. To allow for further adjustment of the model to different efficiencies for each element of s we add the unrolling efficiency vector α to get the final utilization efficiency of the refined roofline model The coefficients α i adjust the impact of the spatial unrolling. According to the terminology used in Chen et al. [2019], α i allows us to adjust the impact of spatial and temporal fragmentation on the overall utilization efficiency. So far, we have identified no other method to derive the values of α from the system architecture than by measurement.
This refined version of the roofline model allows us to model not only the reduced utilization efficiency of n-D convolutions due to the mapping restrictions of existing compute architectures. It can also be used to model jumps in utilization efficiency caused by higher-level features such as the number of input parameters, weights, or outputs.
We apply the simple roofline model with separately measured data throughput rate and peak performance to the pooling, depth-wise separable, and fully connected layers, respectively, under the presumption that accuracy does not have to be as high as for the convolutional layers. However, it is still important to also capture the execution time of those layers. Furthermore, for fused layers, we define the first term of equation (2) as the sum of the execution time of the convolution layer and the following fused layer. For the second term, we adjust the number of transferred data to the overall amount of the fused layer. For example, a convolution layer with a succeeding pooling layer with a stride greater than one has a reduced number of output parameters.
Within the modeling framework, we determine model parameters P peak , B peak , s and α for all layers automatically based on the measurements of the Benchmark Tool. At first, we perform sweep benchmarks to measure the layer execution time while sweeping each of the parameters describing the layer. For example, in one sweep for a 2D convolution layer, we measure the execution time, incrementing the number of input channels in each measurement. These sweeps are performed for each parameter at multiple points, while the other layer parameters are set to the same value for the entire sweep. Based on these measurements, we can extract the preliminary values of P peak and B peak by finding the maximum performance and data throughput values. Next we determinate the values of s i and α i , by fitting equation (3) to the collected data using mean square minimization, with the conditions α ∈ R A | 0 ≤ α i ≤ 1 and s ∈ N A . Lastly with the determined values of s and α we perform the rest of the benchmarks using preferably layer settings with (3) to determine the final values of P peak and B peak .

Statistical Models
Apart from the analytical estimation model, we also generate statistical regression models to estimate the performance for all benchmarked layer types. In general, we found that the statistical models produce more precise results when predicting utilization efficiency rather than the resulting execution time. We estimate the utilization efficiency u stat = f ( x) where u stat ∈ R | 0 < u stat ≤ 1 for each layer separately based on a feature vector x describing the layer's parameter settings. Similar to Yao et al. [2018], Cai et al. [2017] we include higher-level features such as the number of input parameters and the number of operations. For example, for the 2D convolutional layer we select the feature vector x = (h, w, c, f, k h , k w , stride, #ops, #in, #out, #weights).
We applied random forest regression for the statistical models of the network layers, which worked best for the data collected in the benchmarks. Although tree-based regression methods generally do not extrapolate well, they have the useful property that the output values do not explode but remain constant when the input values are outside the training data range. In the case of the u stat estimate, this behavior does not degrade the quality of the estimate. For the final prediction of the layer execution time, we then apply the roofline model with statistically computed utilization efficiency:T Due to the large number of architectural parameters for the convolution layer, we have to carefully select for which configuration parameter settings to perform the measurements. This is important since the points of measurement influence the quality of the resulting statistical models. To find the best points of measurement for our statistical model, we generate three datasets. For the first dataset, we aim to model the surface of points with the best utilization efficiency. Therefore, we reduce the space of measurements to points with utilization efficiency equal to 1. For the generation of the second dataset, we add Gaussian noise to the parameters with s i > 1 to also cover cases with utilization efficiency < 1. The third dataset is the union of datasets 1 and 2.
The experimental results show that, depending on the selected statistical model, too large amounts of measurement points would be required to model the entire surface of dataset 3 correctly. Therefore, we use dataset 1 for the generation of the statistical models and follow a third approach. We combine the generated statistical models with the refined roofline model from Section 5.1.1 to achieve higher accuracy for the points with utilization efficiency < 1.

Mixed Models
To combine the advantages of the statistical and analytical models, we also implement a mixed modeling approach by stacking the statistical model and the refined roofline model. The execution time of the mixed modelT mix for the layer n can be expressed asT Decoupling the modeling of u eff and u stat has the advantage that the necessary model complexity for the estimation of u stat is reduced, as the model only needs to correctly estimate the points with u effn = 1. Fig. 7 shows how combining the statistical model and the refined roofline model results in the mixed model.
The analytical part of the model, namely the refined roofline model, covers the step-wise linear shape of the target surface. Based on the refined roofline model, we can determine at which points we want to perform measurements for the statistical model. As mentioned in section 5.1.2, we only select points with u eff = 1 for computing the regression model for u stat . Therefore, the refined roofline model improves the statistical model twofold: by refining the area with u eff = 1 and regarding the selection of points for the measurements.
Due to the better choice of data points, the statistical model will produce a better result with a lower risk of overfitting. This also explains why the regression model based on dataset 1 is outperforming the models with additional data points. However, thanks to the analytical part of the mixed model, we can still model the local shape of the surface. We can say that while the analytical part is responsible for modeling inefficiencies of the computational architecture, the statistical model covers the memory architecture.

Mapping Models
The last estimation module we present is the mapping model. The main objective is to predict whether two successive layers have been fused or not. This is important for cases where T total = T 1 + T 2 , where T total is the total execution time of layers 1 and 2; T 1 and T 2 are the execution times of the two layers when executed separately. As mentioned above, this difference is mainly due to reduced off-chip data transfer and pipelining effects. For the generation of the mapping models, we use the input feature vectors x previously defined for the statistical model and aim to predict the values of the fused flags extracted by the Graph Matcher in Section 4). We rely on Decision Tree Classifiers to determine the rules for the mapping prediction. For example, Fig. 8 shows a simplified version of the decision tree for the fusion of a convolution layer followed by a max-pooling layer. We can see that in the example shown, the decision if the two layers are merged or not depends mainly on whether a certain number of channels and filters in the convolution layer is exceeded or not. We apply the same concept to all fused layer combinations we were able to find in our evaluation networks in Tab. 2.

Estimation
For the network level estimation, we apply the stacked model presented in Section 3 on a network description graph. At first, we apply the mapping models to reconstruct the mapping of the platform mapping toolchain. For this, we iterate through all directly connected layers and check whether they should be fused or not. Afterwards, we apply the layer level models on each remaining layer of the optimized graph. The network execution time estimationT total is the sum of all estimated layersT n .
Because of the different models available for each layer, we implement the estimation framework in a way that we can select the preferred model type but always use the roofline model as a fallback solution so that the highest possible number of layers execution times is always estimated.

Results and Performance Analysis
To quantify the accuracy of the latency estimation methods presented in Section 3, we compare the estimated results to measured times for 12 state-of-the-art DNNs listed in Tab. 2 from Xilinx Model Zoo Xilinx [2020b] and a randomly selected subset of 34 networks from the models generated in NASBench Ying et al. [2019a] on target devices.

Experimental Setup
All experiments were performed with batch size 1 to achieve the lowest possible latency, but by adding the batch-size as an additional input parameter for the benchmark dataset and by adding the batch size to the input feature vector of the estimation models, it would also be possible to extend the method to larger batch sizes. For Xilinx DPU, we used a ZCU102 evaluation board with a DPU configuration of 4096 MAC units. Measurements on the NCS2 were performed with an Intel i5-4590 3.3 GHz host processor equipped with 16 GB of RAM in synchronous mode. For both platforms, we used the provided tools for mapping and compilation. To assess the estimator performance, we use two test sets.
Test set 1 contains the 12 DNNs listed in Table 2, and we use it to evaluate in detail the performance for commonly used networks. With Test set 2, we aim to understand whether ANNETTE could be used for a hardware-oriented neural architecture search. Therefore, we randomly select 34 models of the NASBench Ying et al. [2019b] neural architecture search dataset, which contains a large variety of different architectures with similar sizes, and evaluate the accuracy and fidelity of our estimator.  Figure 9: The Experimental Setup for prediction accuracy evaluation Figure 9 shows the experimental setup. In the first phase, the benchmarks from Section 4 are executed on the target platforms. The execution times of the layers are extracted using the provided profiling tools and stored together with the configuration files of the benchmarks. With the Model Generator (Section 5), the mapping and hardware abstraction models are derived and made available to the Estimation Tool (Section 6). In the second phase, the network graphs are fed into the estimator. For evaluation, the resulting estimated times are compared with the execution times measured on the target device. The detailed information provided by the profiling tools allows us to compare not only the total execution times of the networks but also the execution time of each layer.

Layer Execution Time Models
First, we evaluate the accuracy of the previously presented layer execution time models. Tab. 3 reports the Mean-Absolute-Error (MAE), the Mean-Absolute-Percentage-Error (MAPE) and the Root-Mean-Square-Percentage-Error (RMSPE) of the different layer models for all convolution layers of the networks in Table 2. The results were estimated and measured for both the NCS2 and the ZCU102 SoC-board. Additionally, we also report the accuracy of other state-of-the-art execution time prediction methods Qi et al. [2017], Cai et al. [2017]. The mixed model outperforms the other model types for both platforms in terms of MAE, MAPE and RMSPE. It is noticeable that for the ZCU102, the refined roofline model has a lower MAE than the statistical model. Since the MAE is a non-weighted error metric, we conclude that for the ZCU102, the refined roofline model predicts larger layers more accurately than the statistical model.
For fair comparison to other state-of-the-art works, it has to be mentioned that the reported numbers were measured on a different set of networks 2 and for a different set of target devices. While the Paleo Qi et al. [2017] and NeuralPower Cai et al. [2017] target server GPUs (Titan X), our work targets prediction for specific accelerators for neural networks. However, even in this case, the statistical prediction method outperforms the analytical model. Nevertheless, analytical models are easier to understand and can be easily adapted to similar architectures, whereas a statistical model can only be based on measurements. Additionally, we applied the NeuralPower estimation method with our collected data for the NCS2 and ZCU102, but we were not able to produce any useful results with a MAPE lower than 1000%, so we don't list the results of this approach in Tab. 3. To our mind, these results are a consequence of the bad extrapolation behavior of polynomial functions, which are used for estimation in NeuralPower.

Mapping Models
We evaluate the performance of the mapping models on the dataset consisting of the layers from the example networks generated by the Benchmark Tool. For the training data set, we consider only the layer pairs that contain the target layer, e.g., for training the decision tree that predicts whether a pooling layer is fused or not, we include only layer pairs in the data set, at least one of which is a pooling layer. Then we select 80% of the samples for training and 20% for validation. Tab. 4 shows the F 1 score and the Matthews Correlation Coefficient (MCC) for the fusing of element-wise addition and pooling layers.
Since the F 1 score ignores true negatives, the MCC, which depends on all four confusion matrix categories, should be preferred for the evaluation of the binary classification Chicco and Jurman [2020]. It can be seen that the mapping prediction works quite well for both platforms. However, the prediction for the DNNDK (ZCU102) for both layer types achieves a higher F 1 score and MCC than the prediction for the NCS2. We assume that the reasons for this are that the DNNDK is generally more capable of merging several layers and that the optimization behavior of the OpenVINO toolkit depends more on the architecture of the whole network than only on the parameter settings of the individual layers. For evaluation of the generated platform models of the NCS2 and DNNDK, we perform the mapping and layer-wise estimation for the models listed in Table 2. Then, we compare the predicted network execution time with the measured time. Table 5 shows the MAE and MAPE of all presented models for the executed networks for the ZCU102 and NCS2.  Table 2 on DNNDK Fig. 10 and Fig. 11 show the estimation accuracy of the platform models. Due to moderate parallelization effects on the NCS2, the roofline model and the refined roofline model have similar performance. However, in some cases, the refined roofline model provides slightly better predictions. Also for the NCS2, the statistical and the mixed model achieve almost almost similar performance with a MAPE of 7.92% and 7.44%, respectively. Overall, the mixed model consistently performs the best for the NCS2. Similarly, for the ZCU102, the mixed model provides the most accurate predictions with a MAPE of only 3.47%. Interestingly, in the case of the ZCU102, for some of the networks, the refined roofline model estimates the network execution time more accurately than the statistical model. Since the refined roofline model mainly covers reduced utilization efficiency due to the computational architecture, we can conclude that for those cases, the main inefficiency lies in the low utilization efficiency of computational resources due to a parameter not aligning with the number of available multiplier resources (see Seciton 5.1.1). The comparison to other state-of-the-art execution time estimators, which are also denoted in Tab. 2, is difficult since the necessary complexity of the model and the resulting accuracy highly depends on the target device. In addition, the evaluation performed in this work includes more complex and larger networks with several different layer types than in other works. To evaluate the accuracy of the estimations for design space exploration, we perform the estimation for a randomly selected subset of 34 network architectures generated for the NASBench dataset. We select this dataset since it contains several networks with similar sizes that were constructed for the same task. Therefore it is more appropriate to evaluate the fidelity of the estimation tool on Test Set 2. We assess the performance on Test Set 2 for the NCS2, which was performing worse on Test Set 1. Table 6 provides the MAE, MAPE and Spearman's rank correlation coefficient ρ as fidelity metric. A perfect Spearman correlation of +1 occurs when the variables are a perfect monotonically increasing function of each other. This property makes ρ a valid measure for fidelity Javaid et al. [2010].   there is no difference between the results of the roofline and the refined roofline model. Hence, also the statistical and mixed models achieve the same results. For Test Set 2, the mixed/statistical modeling approach reaches almost a Spearman's rank correlation coefficient of +1 and outperforms the analytic models by more than 20 percentage points in MAPE.

Conclusion
We propose a framework for execution time estimation for neural network hardware accelerators. It is based on stacked models, consisting of mapping models and mixed layer models. We generate the models based on micro-kernel and multi-layer benchmark results and evaluate the performance on two sets of networks for two selected hardware accelerators. Overall, the mixed models perform best. For a set of 12 state-of-the-art DNNs, the estimation with mapping models and mixed models reach a MAPE of only 3.47% on the Xilinx ZCU102 SoC and 7.44% on the Intel NCS2 when estimating total network execution times. For the use case of design space exploration, we evaluate the fidelity of the generated models by applying the estimation method on a randomly selected subset of 34 models of the NASBench dataset. The estimation with mapping models and mixed layer models reaches fidelity of 0.988 in Spearman's ρ rank correlation coefficient metric. The evaluation demonstrates the advantages of applying mixed models for the selected hardware platforms. In the future, we aim to extend the evaluation to additional embedded hardware, such as the Nvidia Jetson platform, to gain additional insights for a different class of accelerators.
Due to the large parameter space of DNNs, one crucial point for the development of the estimation framework is to make assumptions about the computing architecture to exclude as many non-meaningful measurement points as possible. An essential clue is the step-wise linear nature of architecture resources, such as an array of multipliers or caches. They follow a linear performance trend until the cache or the multiplier array is fully allocated. Besides, for a precise estimation, it is important to consider not only the individual layers in isolation but also how they are executed in the overall context.
We are confident that accurate estimation methods can significantly facilitate informed making of decisions. Nevertheless, it is in the area of neural architecture search where estimation can make a critical contribution to a hardware-specific search or the right choice of networks and hardware in advance of the development of applications.