Heterogeneous Systolic Array Architecture for Compact CNNs Hardware Accelerators

Compact convolutional neural networks have become a hot research topic. However, we find that the systolic array accelerators are extremely inefficient in dealing with compact models, especially when processing depthwise convolutional layers in the neural networks. To make systolic arrays more efficient for compact convolutional neural networks, we propose the heterogeneous systolic array (HeSA) architecture. It introduces heterogeneous processing elements that support multiple modes of dataflow, which can further exploit the reuse data chance of depthwise convolutional layers and without changing the scale or structure of the nave systolic array. By increasing the utilization rate of processing elements in the array, the HeSA improves the performance, throughput, and energy efficiency compared to the standard baseline. In addition, we design the flexible buffer structure for the communication between the computing array and external buffer. Through flexible routing, the HeSA can achieve a large-scale array design with maintaining high processing elements utilization rate and low communication costs. Based on our evaluation with typical workloads, the HeSA improves the utilization rate of the computing resource in depthwise convolutional layers by 4.5 - 11.2 and acquires 1.6 - 3.1 total performance speedup compared to the standard systolic array architecture. In the large-scale array design, the HeSA can reduce the data traffic by 40% while maintaining the same performance as the scaling-out method. By improving the on-chip data reuse chance and reducing data traffic, the HeSA saves over 20% in energy consumption. Meanwhile, the area of the HeSA is basically unchanged compared to the baseline due to its simple design.


INTRODUCTION
T HE compact model is one of the popular convolutional neural networks (CNNs) compression approaches [1], which aim to reduce the cost of memory and computation in CNNs. Because the convolution is the most expensive part of CNNs, compact models or compact CNNs choose other simplified network components to replace the standard convolution (SConv). A popular choice is the depthwise convolution (DWConv) [1], [2], [3].
Depthwise convolution applies a single filter to each input channel, which means one spatial convolution is performed independently over each channel of an input feature map (ifmap) [1]. The fundamental hypothesis behind DWConv is that spatial and channel correlations can be sufficient decoupled and separately realized. Compared to SConv, DWConv saves lots of parameters and Multiply-Accumulate Operations (MACs) while maintaining high accuracy. Due to the advantages of DWConv, the compact CNNs using DWConv have become a hot research topic and a series of compact CNNs models have also been widely used [4], [5], [6], [7], [8], [9].
DWConv would bring fewer operations and less latency in the mobile/edge hardware accelerators. However, we find that the popular systolic array (SA) design is very inefficient when computing compact CNNs. [10], [11], [12], [13]. Fig. 1 shows the strange result. We count the amount  of floating point arithmetics (FLOPs) in three state-of-art compact CNNs and record their latency breakdown in a 16×16 SA. We find that the FLOPs of DWConv in the model account for about 10% of the total, but lead over 60% of the latency. This result is related to the architecture and dataflow of the SA. The SA architecture is composed of two-dimensional simple and homogeneous processing elements (PEs). By the dataflow mapping data to the array, the SA can process and accelerate matrix multiplications (GEMM) and accumulations, which account for more than 95% of the operations in the SConv layers of the traditional CNNs [14]. Usually, the scale of GEMM in SConv layers is large and multiples of the array size. Tiling and processing these GEMMs (grey boxes) can fully utilize PEs (red boxes) in the SA [15], as shown in Fig. 2a. However, the GEMM shrinks to the matrix-vector If the array is further (c) scaled-up, performance will be greatly affected.
multiplication (MV) in the DWConv layers. Thus, the tiles of MV lead to many idle PEs (white boxes), which significantly decreases the performance, as shown in Fig. 2.b. Moreover, when the SA processes DWConv layers, the larger the size of the SA, the lower the PE utilization rate. Fig. 2c shows a typical situation. The performance of the SA is not only affected by the MV, but also limited by the size of the model. This will result in extremely low utilization. Therefore, the scalability of the SA will also affect the final performance of the hardware accelerator.
To solve the inefficiency of calculating DWConv, the dataflow of the SA must be changed [16]. However, the architecture of the SA is too simple. It only supports one fixed dataflow. Some related work [16], [17] try to change the mapping method and modify the architecture. But they all bring in too much hardware overhead. These changes wipe out the advantages of the SA design and are not suitable for deployment on resource-limited mobile/edge platforms.
To overcome these challenges, we propose the heterogeneous SA (HeSA), which can support multiple dataflows. Different from the previous solution [16], [17], the HeSA maintains the original structure of the SA and only makes changes to the PEs in the SA. When processing SConv, the HeSA still behaves as a traditional SA, maintaining the high on-chip reuse opportunities and low resource consumption. When dealing with DWConv, the HeSA switches the dataflow and data mapping of the SA by changing the data path in the heterogeneous PEs. The HeSA can increase the performance of DWConv by 4.5× -11.2× without adding additional data paths or increasing external/internal bandwidth. And the HeSA can accelerate compact CNNs by 1.6 -3.1× and improve 1.1× energy efficiency compared to a naïve SA design.
At the same time, we delve into the scalability of the SA. The large-scale PE array often has a very low PE utilization rate when processing the compact CNNs. For this challenge, we choose to use a flexible buffer structure for the communication between the PE array and the external storage. Through flexible routing connection, our design not only guarantees a high PE utilization rate but also reduces the data traffic. Compared with the traditional scaling-up solution, the performance of the array is improved by nearly 2×. Compared with the radical scaling-out method, the data traffic is reduced by 40%. Part of this research was presented at DATE 2021 [18].
The main contributions of the paper are summarized as follows: • We propose a new dataflow for the SA, which can process DWConv more efficiently. It increases the data reuse opportunities of DWConv in the SA design and can map more data to the PE array. • We design the heterogeneous systolic array architecture to support multiple dataflows. Flexible switching of dataflows can make better use of data reuse opportunities in CNNs. In this way, accelerators can better process and accelerate compact CNNs. • We design the flexible buffer structure which is responsible for the communication between the array and the external buffer. It ensures that when the array size is increased, the performance can be better improved, and the data traffic can be reduced as much as possible.
• Our experimental results demonstrate the validity of the HeSA design. We show the advantages of our design compared to the traditional SA in terms of the PE utilization rate, scalability, performance, area, energy consumption, etc.

Compact CNNs
CNNs use the convolutional layers to extract different features of the input data [19], [20]. The biggest difference between compact CNNs and traditional CNNs is the implementation of convolutional layers. for r = 0 to R do 4: for r = 0 to R do 5: for k = 0 to K do 6: for k = 0 to K do 7: O[m, r, r]+= W [m, c, k, k] * I[c, r + k, r + k] Standard Convolution (SConv) represents the standard convolution algorithm. There are multiple channels and filters (kernels or weights) to extract different local features of the ifmaps. The operation of SConv can be described as a 6-nested loop (regardless of batch size) [21], as seen in Algorithm 1, where m is the m-th channel of output feature maps (ofmaps, some papers [22], [23] also call m number of filters), c is the c-th channel of ifmaps, r is the height and width of ofmaps (height is equal to width), and k is the height and width of kernels. To facilitate computing and deployment, SConv is converted into GEMM through the im2col algorithm. The dimensions of GEMM are large due to the deep and wide structure of the traditional CNNs.
Depthwise Convolution (DWConv) replaces SConv in compact CNNs. Unlike SConv, where entire channels in a kernel are convolved with all ifmaps and produce one ofmap, DWConv only allows one filter-channel to convolve with only one ifmap to produce one ofmap [1]. This means for r = 0 to R do 4: for k = 0 to K do 5: for k = 0 to K do 6: O[c, r, r]+= W [c, k, k] * I[c, r + k, r + k] Furthermore, composed by pointwise convolution (PW-Conv, small-scale SConv) and DWConv, compact CNNs can maintain high accuracy which even surpass the accuracy of dense CNNs [5], like MobileNetV3 [24], MixNet [4], EfficientNet [5], etc. Fig. 4 shows the architecture of the SA. They comprise several PEs (usually MAC units), and each PE stores the incoming data in an internal register and then forwards the same data to the neighbor PE [14]. These store and forward behaviors result in significant savings in external storage and read bandwidth, and can very effectively exploit reuse opportunities provided by GEMM [15]. Note that this data movement and reuse is achieved without the help of control or router units. These advantages make it a popular choice for accelerator design [15].

Systolic Arrays
Dataflows decide the schedule of GEMM tiles in the SA. The Output-stationary (OS) dataflow is the commonly used one and achieve high ifmaps and weight data reuse opportunities [1]. Fig. 4 also shows the operation of the OS dataflow, where the two input matrices for a GEMM tile are shifted horizontally and vertically respectively, and the output matrix is fixed on the array.

Scalability
Scalability is important in the SA design. Intuitively, the more PEs the SA contains, the higher its performance will be. However, hardware scaling must also consider various issues such as the PE utilization rate, data traffic, area, and energy consumption. Therefore, it is a challenge for the SA to improve performance by increasing the array size.
Scaling-up [14], [15] is a common method. It is very simple and only enlarge the PE array size. Google's TPU series uses this method to design the large-scale SA [25]. However, it is hard to map compact CNNs and smallscale workloads to the large-size SA. The bigger the array, the more idle PEs caused by compact CNNs. So the PE utilization rate of the SA always decreases when the array size becomes larger.
Scaling-out [14], [15], [16] can handle these problems. This method chooses to have more small arrays rather than making the arrays bigger. So, it can maintain a higher PE utilization rate. Some work [11], [12], [14] choose this method to design their accelerators. But scaling-out needs more hardware overhead and increases the data traffic.

Related work
Systolic arrays have become the main trend in hardware CNNs accelerators deployed in both edge devices [11], [12], [16], [26], [27] and servers used in data-centers [25], [28], [29], [30]. Google designed a deep learning accelerator TPU [25], [28] based on a SA and achieved success. This makes the two-dimensional SA a hot research field. NVIDIA also integrated the SA into the GPU architecture and proposed the Tensor Core architecture [29], [30] to accelerate deep learning. Coupled with low-precision calculations and network sparsity, Tensor Core architecture further improves the performance of the GPU and makes it the most popular deep learning accelerator. In academia, Pham et al. [10] deploy the W eight Stationary dataflow in the PE array and use the systolic mode to reuse data. However, because the array size is limited to the size of the kernels, its scalability is poor. Du et al. [11] use the OS dataflow to make the PE array more scalable. Its architecture is similar to a SA. By reusing data between PEs, their design achieves high performance. But it is not optimized for depthwise separable convolution. Lu et al. [31] and Chen et al. [26] propose a new dataflow and change the systolic architecture. These designs are efficient in the calculation, but their data transmission is completely dependent on the bus interconnection. It means that the implementations of designs may have difficulty in achieving the timing closure for a high clock frequency or even passing the place and route [15]. However, we find that these designs are relatively inefficient when computing Compact CNNs.
Some proposals try to overcome these problems. Chen et al. [16] and Liu et al. [17] try to change the dataflow of the SA to improve the PE utilization rate. However, both of them introduce a lot of storage units, routers, data paths, and control units, increase the hardware overhead, and break the original intention of simple and efficient designs of the SA. Jha et al. [13] try to change the structure of compact CNNs, and make them more suitable for the SA. But it needs to re-train a trained compact CNN and is impractical for edge/mobile devices.
Of course, there are other designs for the SA. For example, Kung et al. [32] and He et al. [33] use column combining to train and compress models, and Lym et al. [14] using structured compression to train CNNs. However, none of these designs consider the situation of compact CNNs.
In this paper, we consider the design of edge/mobile platforms with a small SA, because mobile devices are more likely to face the challenge of compact CNNs and their hardware resources are limited.

ARCHITECTURAL CHALLENGES FROM THE COMPACT CNNS
The architecture and dataflow of the traditional SA lead to the inefficiency of calculating DWConv. However, through experiments and analysis, we find that the structure of the SA still has room for optimization in compact CNNs.

Analysis
For further analysis, we record the PE utilization rate of a 16 × 16 SA processing MobileNetV3, as shown in Fig. 5a. These experimental results are all obtained through a configurable systolic array-based cycle-accurate DNN accelerator simulator [15]. The PE utilization rate of most of the SConv layers exceeds 90% due to the high efficiency for processing GEMM of the SA. However, the average PE utilization rate of DWConv is only about 6% and even only 3% at the worst. It increases the calculation delay and reduces the performance of the accelerator.
To find a solution, we analyze the roofline model of the 16 × 16 SA. We sweep every layer of MobileNetV3, and Fig. 5b shows the roofline data points for the models. Most SConv layers are in the region of compute-bound and near the roofline, which means they make full use of the SA. DWConv layers are in the region of memory-bound, which means the reused space of data is not enough to activate every PE in the array.
More importantly, our roofline analysis shows that the performance of DWConv layers only accounts for 10% of the theoretical performance. It means that the current dataflow cannot make full use of the computing resources of the SA and the PE utilization rate of the DWConv layers has room for further improvement.

Discussion of dataflows
In section 2.2, we introduce how the OS dataflow schedules and maps data. Fig. 6a shows the data space of ofmaps when the S × S SA processes SConv layers with OS. It processes the ofmaps from S channels and S activations (pixels in ofmaps) at one time [21]. Meantime, it can achieve the high data reuse opportunities of ifmaps and weights (Fig. 6d). We call this OS version OS-M (M for Multi-channels). However, as we describe in section 2.1, the DWConv has no opportunity to reuse data in the ofmaps channel dimension (between multiple filters). If adopting OS-M directly, the ofmaps data space is only S (activations) (Fig. 6b) for the SA. Fig. 6e also shows the ifmaps data cannot be shared between filters, resulting in a large number of idle PEs when the SA with the OS-M dataflow is processing DWconv. So, we use a variant of the OS dataflow (OS-S), which processes ofmaps data from the same single channel at a time to maximize data reuse opportunities of DWConv [1] (S for Single-Channel). Fig. 6c shows the ofmaps data space of processing DWConv with the OS-S dataflow. Ideally, it can increase the reuse opportunities of output data by S× compared with the OS-M dataflow.
To map the single-channel ofmaps data to the SA, not only needs horizontal ifmaps data transmission but also need to transfer it vertically [11]. Because computing one ofmap data requires ifmaps data from multiple rows and columns, this allows ifmaps data to be reused between multiple rows and columns of PEs (including horizontal and vertical direction). If the ofmaps data is directly mapped to the PE array according to its position, one PE needs the ifmaps data from the PEs of the next row (Fig. 6f). It cannot be achieved in the traditional SA which only supports input data propagating in a single direction. But the HeSA can support it, and we illustrate that in Section 4.1. Fig. 7 shows the overview of the HeSA architecture. It is composed of a SA, on-chip buffers, a control unit, and the connection to the host device. The SA uses heterogeneous PEs to support the OS-M and OS-S dataflow. On-chip buffers provide fmaps and weight data for the SA. The control unit communicates with the host device, moves data, and controls the work state of the SA. We describe these structures below.

Operation process of the HeSA
Since we adopt different dataflows in the SA, the traditional operating procedure must be adjusted and changed accordingly. The operation process and data mapping of the OS-M dataflow remain unchanged, so we do not discuss and explain in detail.
However, the data mapping of the OS-S dataflow is different from the OS-M dataflow. So we make many changes and adjustments to the SA operating procedure to adapt to the OS-S dataflow. The new operation process has two main features. First, the ifmaps data needs to be transmitted horizontally and vertically between PEs, as described in To illustrate this process more clearly, we take a SA that applies the OS-S dataflow to compute a simple convolution as an example. Fig. 8a shows the convolution, which contains a 3 × 3 ifmap, a 2 × 2 kernel and a 2 × 2 ofmap. The OS-S dataflow needs to map and fix the ofmap data to the PE array. However, the ofmaps data are not directly mapped to the PEs according to their positions. They need to be rotated by 180 • and then mapped, as shown in Fig. 8b. The advantage of this adjustment is that ifmaps can be reused and spread from top to bottom, rather than from bottom to top (Fig. 6f). In this way, there is no need to design a separate upward data path for ifmaps data.
After the data mapping is completed, the operation process of the PE array enters the preloading/computing phase. In order to allow the SA to support for the OS-S dataflow, we also modify the PE structure in the array. A traditional PE is composed of the weight register (REG1 in Fig. 8b), the input data register(s) (REG2 in Fig. 8b), and the MAC unit which is used to calculate and store output data in the traditional SA. However, the PEs require additional input data registers to cache ifmaps data that needs to be propagated vertically to other rows in the OS-S dataflow. So we add an extra register (REG3 in Fig. 8b) for PEs. It is noted that the PEs in the last row of the array do not need REG3. Fig. 9 shows how the 2 × 2 SA with the OS-S dataflow computes the convolution. For conciseness, we only describe the operation process of the six cycles. The accelerator can complete the work by repeating the process. In addition, We use arrows to indicate where the data will flow to at the next cycle.
Cycle #i: Data preparation. The external buffer prepares the input data. Note that the data is skewed (the data in the next buffer row has one more bubble than the previous row).
Cycle #i+1: Preloading. The PEs of the first row in the array start preloading the input data. At the same time, the external storage prepares the weight data for the next cycle (the number of cycles required for preloading is array w idth − 1). It should be noted that the weight data is the same for each column of the PEs.
Cycle #i+2: Calculation starts. PE 00 and PE 01 finish preloading data and they start the calculation. The PEs in the second row is still in the preloading phase because of the presence of skew in the input feature data. In addition, PE 00 and PE 01 also need to store input data of this cycle into the REG3 registers for data reuse for the next row of PEs.
Cycle #i+3: Calculation continues. At this cycle, PE 00 and PE 01 complete the calculation of the first row of input feature data and need the next row of input feature map data. So their input path switches to the storage above the array for the new row of input feature data. Meantime, PE 10 and PE 11 receive the data and start the calculation. Cycle #i+4: Calculation continues. The buffer on the left side of the array continues to provide the input feature map data for PE 00 and PE 01 . PE 10 and PE 11 switch to the next row of input feature map data so that the input data is provided by REG3 in the first row of PEs. At the same time, the data in REG3 is updated. (In this case, this update has no effect, but is necessary for larger kernel sizes.) Cycle #i+5: Calculation is finished. PE 00 and PE 01 complete their calculation of this channel of feature maps and weights, so the external buffer starts preparing the data from the next channel of input feature maps and weights. It corresponds to entering a new preloading phase (Cycle #i'). The second row of PEs is still calculating and needs one more cycle to complete the calculation.
By pipeline and loop these phases, the SA with OS-S can finally load and calculate the entire workload.
In the toy example presented above, we can see the HeSA using heterogeneous PEs (the structure of PEs is different in different rows) to achieve multi-directional input data transformation to support OS-S. Unlike Chen et al. [16] and Liu et al. [17], the HeSA still uses the systolic data transformation mode, and it can also be used as a standard SA and support the OS-M dataflow. However, this toy example introduces some components, like REG3, vertical input data paths, and additional external storage for input data from the upside. These components increase the hardware overhead and bring challenges to the hardware accelerator design of edge/mobile devices.

Heterogeneous PEs
In section 4.1, we conclude that the HeSA PEs should have additional data paths and registers. But we can reuse some components in traditional PEs to hide this hardware overhead. Fig. 10a shows the structure of the traditional PE. It reads the weight (input) data from up (left) and transfers the weight (input) data to down (right). The PE also accumulates the product of the input and weight data, and the Psum register stores the partial sum temporarily. After the PE computation has been finished, output data is passed into the output register and is shifted across vertical PEs to the corresponding buffer.
Since the data paths and registers for output vertical transmission are always idle during the computation, we reuse these components for the input vertical transmission.   Fig. 10b shows the design of PEs in the HeSA. We connect the output data path with the input data path (red lines in Fig. 10b) and control the data transmission by the multiplexer (MUX, the control unit manipulates it). When the HeSA uses the OS-M dataflow, MUX will close the red path, and the structure of PE in the HeSA is the same as the traditional PE. When the HeSA uses the OS-S dataflow, the MUX will choose the red path, and the output register is also used to cache the input data. In addition, the SA with the OS-S still needs an extra storage unit (like the register set) with additional control logic and data paths for the top PEs (like PEs in the row0 of Fig. 9e, Cycle #i+4). It brings the additional hardware overhead to the SA architecture, as shown in Fig. 11a. To avoid this overhead in the HeSA design, we use the PEs on the top of the array as the register set (these PEs are no longer used for calculation) for preloading the input data for the next row PEs, as shown in Fig. 11b. Although affecting the performance, it saves the hardware cost of the storage and does not need to modify the SA architecture. Meantime, the performance penalty of this design is acceptable.

Buffer and controller
In the HeSA, on-chip local buffers adopt double buffering strategy. Double buffering enables the overlap of computation of the PEs with memory access and allows for very simple coarse-grain control of data transfers between buffers and memory. The control unit is responsible for connection with the host device, data distribution, and arbitration in the standard SA. In the HeSA, the control unit can also control the data path in PEs to switch dataflows flexibly. In the compilation stage, we specify which the dataflow is used by the current layer of the network. Since we only add one MUX unit for each PE, there is only one more bit of control signal, and the overhead is negligible.

SCALABILITY
Because the workload of the compact CNNs is usually small, it actually brings challenges to the scalability of the SA.

Scaling-up and scaling-out
There are two major ways to enlarge the design scale of the systolic array. One is scaling-up, as shown in Fig. 12a. It is a very simple method, and just need to add more PEs to the array. The advantage of scaling-up is the need for less bandwidth. When the array is scaled up by a factor of N , the bandwidth required by the array is only increased by the √ N . This reduces the data traffic between storage and computing arrays. In fact, it is equivalent to selecting a largescale array for the accelerator and has been widely used in the industry [25].
However, there are problems with the scaling-up method. As the array becomes larger, the PE utilization rate of the SA decreases once the data size is small or the workload parallelism is limited, as shown in Fig. 2c. It is very unfriendly to the accelerator, especially for the inference task, which needs low latency. So maintaining a high PE utilization rate and performance is more important.
Another way of scaling is the scaling-out method, as shown in Fig. 12b. It uses many small arrays to replace a large-scale array in the design and is more suitable for compact CNNs. From another perspective, each PE array can be regarded as a separate computing node. So the design is more flexible in the process of data mapping. Compared with the scaling-up, the scaling-out can maintain a higher PE utilization rate and better performance on the less parallelism workload.
But the scaling-out design also has problems. Each array contains separate buffers, and such distributed storage means additional data read and write overheads (such as data replication and synchronization). In addition, the scaling-out design needs more bandwidth compared to scale-up. When the array is scaled by a factor of N , the bandwidth required by the array is also increased by N times. What's more, the accelerator with multiple arrays requires the host or control unit to allocate computation and data to the arrays reasonably, which is also a challenge.

Flexible buffer structure
In our design, We try to combine the advantages of both scaling-up and scaling-out while avoiding their shortcomings. So we design the flexible buffer structure (FBS) which uses the crossbar to connect multiple small arrays and buffers in the HeSA, and Fig. 13 shows the HeSA using the FBS design. It is very similar to the scaling-out design, but we optimize the buffer design and add the crossbar unit.
The crossbar unit has three connection modes, which are one-to-one unicast (Fig. 14a), one-to-two multicast (Fig. 14b), and one-to-all broadcast (Fig. 14c). In this way, we can make a combination of multiple arrays so that these arrays share one buffer, achieve unified storage space, and reduce the data traffic. Compared with the scaling-up design, it is more flexible, because by adjusting the routing scheme, a variety of array combinations can be simply implemented to adapt to more task scenarios. Compared with the scalingout design, our design can make use of the data reuse and parallelism as much as possible to reduce buffer read and write costs.
Because it only needs to support three connection modes, the structure of the crossbar unit is very simple. Fig. 15 shows the crossbar structure. Users only need to configure the crossbar to switch the communication mode. For example, if we turn on the red path in Fig. 15, we realize the 1-4 broadcast.
To further illustrate the optimization, we take the 8 × 8 array scaling to the 16×16 array as an example. The scalingout design uses four 8 × 8 small arrays, and the scaling-up design uses a 16 × 16 array. But the flexible buffer structure in the HeSA can realize the multiple allocations of the PE arrays, as shown in Fig. 16. It can achieve not only the array sizes of the scaling-up (f in Fig. 16) and scalingout (a in Fig. 16) design, but also the PE array of other sizes. This is equivalent to the flexible switching between a large-scale array and multiple small-scale arrays according to the condition of the workload. Users can achieve this by properly configuring the crossbar in the flexible buffer structure, which is also shown in Fig. 16.
The biggest advantage of the FBS is that it realizes the adjustment and allocation of bandwidth through configuration. Fig. 17 shows the comparison of the normalized maximum bandwidth of the three scaling methods. Scalingout has the largest maximum bandwidth for flexibility. Scaling-up has a small maximum bandwidth. Since FBS is configurable, it has the most flexible bandwidth options, ranging from the largest to the smallest bandwidth. It can even independently configure the bandwidth of the ifmaps and weights data according to the task requirements. This ensures our design can achieve better performance/energy efficiency than scaling-up and scaling-out. if P E U til M ax <= P E U til C then 4: if BW M AX op C >= BW M AX C then 5: op C = C 6: We also design the FBS configuration algorithm. For a certain layer in the CNNs network, by estimating the PE utilization rate of all configuration, the algorithm selects the optimal scheme with the maximum PE utilization rate and minimum bandwidth, as shown in Procedure 3. It should be noted that the estimation of the PE utilization rate of the SA is very simple and can rely on simulators [15] or mathematical calculations [34].

EVALUATION
In our experiments, we use three compact CNNs as workloads -MobileNetV3 [24], MixNet [4] and EfficientNet-lite0 [5]. These models are up-to-date and widely used in applications for mobile or embedded devices. Moreover, due to the characteristics of the mobile/edge platforms, we apply 8-bit quantization for all of the network models and specify that the mini-batch is 1.
We simulate the HeSA and the standard SA with SCALE-Sim [15], which is a configurable systolic array-based cycleaccurate DNN accelerator simulator. By using this simulator, we can get accurate information on the runtime of the array and PE utilization rate.
For scalability, we evaluate three solutions: the scalingup, the scaling-out and the flexible buffer structure (FBS). We design two scenarios, 8 × 8 scaling to 16 × 16, and then scaling to 32 × 32. These array sizes are all currently used in common designs [11], [12], [13], [15], [16], [31], [32]. And the model size of CNNs is always the multiple of 32 or 16, so the designs with these array sizes is also the most efficient. We focus on testing the PE utilization rate and data traffic when the design adopts different scaling methods. The PE utilization rate can reflect performance and latency, and data traffic can reflect bandwidth and energy consumption.
For the area, we implement the HeSA in RTL, which is synthesized under a commercial 45 nm library by Synopsys Design Compiler. For the energy consumption, we use Aladdin [35], which is the power-performance accelerator simulator, to evaluate the power consumption of the HeSA design. We also compare the HeSA and other hardware accelerators. The configurations are listed in Table 1. For a fair comparison, we make these parameters of implementations to be the same. We choose Gemmini [12] for the RTL design of the standard SA. Gemmini is an open source TPUlike systolic array generator, and can produce optimal RTL according to parameter requirements [12]. Eyeriss [26] is the most classic CNNs hardware accelerator.

PE utilization
We first show how the HeSA improves the PE utilization of the array by switching dataflows. Fig. 18 shows the PE utilization rate of different layers of MixNet processed by an 8 × 8 SA with different dataflows. SA-OS-M represents a standard SA that only uses the OS-M dataflow. SA-OS-S represents a variant SA that only uses the OS-S dataflow [11], and the HeSA is our design.
The results show that for SConv layers (R × R K × K PW in Fig. 18), the average PE utilization rate in SA-OS-M is about 90%, while the PE utilization rate in SA-OS-S is relatively low, most of which are only 70%. However for DWConv layers (R × R K × K DW in Fig. 18), the PE utilization rate in SA-OS-M is only about 11%, while the PE utilization rate in SA-OS-S is maintained at more than 45%, and the maximum even reaches 75%. The HeSA always keeps the high PE utilization rate of each layer by switching dataflows according to different scenarios.
For further evaluation, we show the DWConv and total PE utilization rate of different compact CNNs in an 8 × 8 standard SA and HeSA in Fig. 19a. Compared with the low PE utilization rate (only about 11%) of the DWConv layers in standard SA, the HeSA can increase the rate to about 60%. Thanks to the huge improvement in the DWConv layers, the total PE utilization rate of three compact CNNs in the HeSA is also improved.
Moreover, the PE utilization rate of the standard SA will decrease when the SA size is increased. But the HeSA with the FBS can solve this problem. Fig. 19 shows the PE utilization rate of the standard SA and HeSA with different scaling methods when the SA size is increased to 16 × 16 and 32 × 32. The standard SA suffers a huge impact. Its DWConv PE utilization rate is even less than 6% (16 × 16) and 3% (32 × 32), and the total utilization rate is less than 30% (16 × 16)and 20% (32 × 32).
However, the HeSA with scaling-out and HeSA with the FBS can keep the DWConv PE utilization rate above 50% (16 × 16) and 30% (32 × 32), and the total utilization rate is up to 85% (16 × 16) and 65% (32 × 32) during the test. The performance of the HeSA with scaling-up is mediocre because a large array does not have the flexibility to handle small-scale workloads.
Meantime, the FBS can also save data traffic. Fig. 20 shows the data traffic of the HeSA with different scaling methods. The scaling-out method consumes the most data traffic, 1.8× of the scaling-up design. But the HeSA with the FBS reduces data traffic by 40% compared to the scalingout design. It can flexibly switch the array size and allocate the bandwidth according to the workload, thereby saving communication costs and data traffic.

Performance
The increase in PE utilization brings about an increase in performance. We sum up the latencies of each layer, and Fig. 21 shows the speed-up ratio obtained by the HeSA compared to the standard SA. The HeSA can get an average 4.5× -11.2× speed-up when processing the DWConv layer compared to the standard SA, and the total performance is 1.6× -3.1× better.

Area
We layout the HeSA with the FBS design (16 × 16) and the total area of it is 1.84 mm 2 . We compare the area of HeSA with other accelerators, as shown in Fig. 22. Thanks to the simple architecture, the standard SA has the smallest area. The area of HeSA only increases by 3% compared to the standard SA. Because the HeSA only adds some control units and data paths, and there are no extra storage or computing requirements. Eyeriss has the largest area. Although adopting a systolic structure, Eyeriss requires huge storage in each PE. Fig. 22 also shows the area breakdown. The PEs in Eyeriss take over half of the total area, which is 2.7× larger than that in the standard SA and HeSA.

Energy
We illustrate the energy consumption of the standard SA and HeSA in Fig. 22. When processing those three compact CNNs, the energy consumption of the HeSA is always less than that of the standard SA (over 20%). Thanks to the improvement of reuse space in DWConv layers, the HeSA reduces the number of reads and writes of buffers and registers in PEs. Therefore, about 30% and 40% of energy consumption are saved in PEs and buffers respectively. However, the energy consumption of DRAM is almost unchanged, because DRAM only stores and moves the fmaps and weight data of the network models, so it is difficult to benefit from the optimization. In addition, the energy consumption of the on-chip network and control units is so small.

CONCLUSION
We design a heterogeneous SA architecture (HeSA) for CNNs hardware accelerators, which can switch different dataflows and exploit the reused space of compact CNNs. It greatly mitigates the mismatch between SA architecture and depthwise convolutional layers of compact CNNs. Tested by typical workloads, our design provides 4.5× -11.2× and 1.6× -3.1× performance speed-up over the baseline in processing DWConv layers and total models. Besides, our design maintains the simple and efficient design principle without changing any structure of the array. Therefore, the area of the HeSA is basically the same as the standard SA.
Thanks to the increase of reuse opportunities, the energy efficiency of the HeSA is increased by about 10% over the baseline.