I. Introduction
The advancement in multi-modality and multi-task learning brings various heterogeneous DNN model architectures that are composed of multiple different types of computations (e.g., convolution layers, fully connected layers, and transformer layers) as shown in Figure 1 part a). The heterogeneous layers prefer different hardware accelerator designs for high performance [1]–[3]. To provide low latency and high energy efficiency, various heterogeneous SoC designs [1], [4] have been proposed recently. The heterogeneous SoC is normally composed of multiple different types of accelerators that are connected to each other spatially through network-on-chip (NoC) as shown in Figure 1 part b). The major features of such heterogeneous SoC can be summarized into two aspects. For computation, the accelerators in the heterogeneous system employ different dataflows [2] (e.g., weight-stationary, output-stationary) and different hardware resources to accelerate different workloads. For memory, each accelerator occupies a local scratchpad memory to store a tile of input/output data for computation. The scratchpad is limited in size and the accelerators have to communicate with each other to get the required data through NoC. The accelerators that are close to each other need less delay (NoC hops) for data transmission compared to those that are farther away from each other.