A Compressed Data Partition and Loop Scheduling Scheme for Neural Networks

Neural networks (NNs) have been widely adopted in various application domains. Deeper NNs greatly enhance the output accuracy, but complex NNs with more parameters incur intensive memory accesses, and the data usually need to be partitioned since it may exceed the on-chip storage. However, there is no research considering the partition and scheduling co-design of the NNs. In this paper, we propose a sparse NN data partition and loop scheduling scheme. We establish the compression efficiency model of the matrix sparse algorithm and design a partition selection method based on sparsity characteristics analyzed by the compression efficiency model. Further, we design a loop scheduling scheme based on the proper partition size. The experiment results show that the average memory access of each layer can be compressed to 68% of the original, and the throughput of the AlexNet, VGG and VGG19 is increased to an average of 1.66 times.

Since the weights of even a single convolutional layer 30 can exceed the local storage capacity, researchers proposed 31 data transformation schemes: graph partitioning uses the 32 synchronous dataflow model, which splits the graph into 33 subgraphs along convolutional layers and maps each sub-34 graph to a different bitstream, however, this scheme requires 35 FPGA (Field-Programmable Gate Arrays) reconfiguration 36 when data flows to the next subgraph. 37 Folding is also an effective method to partition the data: this 38 kind of method folds input by a factor, and a convolutional 39 layer is split into multiple subgraphs that execute a fraction 40 of total convolutional. The interim results are accumulated to 41 generate the output. Thus, the storage requirement is reduced 42 by the folding factor. The folding methods are further divided 43 into coarse-grain folding and fine-grain folding. Coarse-grain 44 folding fully unrolls the major operations of every layer and 45 provides the highest throughput possible. Fine-grain folding 46 is a time-multiplexed scheme between different operations, 47 which use much smaller numbers of hardware units. 48 In addition, the parameter matrix of a neural network is 49 usually sparse, and compression can effectively reduce the 50 The sparse coefficient of the CSR algorithm is two. Ideally, 106 only values and column numbers are stored, and the storage 107 space overhead is 2K. However, the actual situation cannot 108 guarantee that all nonzero values are in the same row, nor 109 can it guarantee that all data are evenly distributed in each 110 row. Therefore, the CSR algorithm requires 2K+c storage 111 space, where c is the total number of rows occupied by 112 data. 113 In addition to storing numerical bits, SCNN [3] also needs 114 to store the number of 0 values from each data to the previous 115 nonzero data. An additional data bit is stored to represent the 116 number of nonzero values in the matrix. Therefore, the sparse 117 coefficient of SCNN is two, and the storage requirement 118 is 2K+1. 119 The Swallow [4] scheme records the number of nonzero 120 values in the same row for each channel and uses the offset 121 bit to record the column number of the data. The sparse 122 coefficient of the Swallow scheme is two. The sparse offset 123 is R, which is the number of rows.

125
Scheduling research on CNN(Convolutional Neural Net-126 works) accelerators has a long history. Since the 1980s, 127 a series of as early/late as possible algorithms [5], degrees of 128 freedom-based algorithms [6], and mobility-based schedul-129 ing algorithms have been proposed [7], [8], [9], [10], [11]. 130 Then, researchers found that the scheduling problem is 131 not only how to split the operator into each multiplier-132 accumulator, but also how to implement a multi-batch 133 pipeline for different convolutional layers [12]. Therefore, 134 optimization is carried out from two aspects: resource alloca-135 tion and dataflow [13]: resource allocation methods mainly 136 focus on how to improve resource utilization, while data flow 137 methods solve how to transport data to achieve the highest 138 performance.

139
In terms of resource allocation optimization, the ACT lab-140 oratory extracts the the control components hyperparameters 141 from the perspective of the Process Element(PE) configu-142 ration for optimization, imitating the Von Neumann control 143 mode, with the help of a large number of expertly pre-144 defined template optimization to add instructions to guide 145 dynamic scheduling in the FPGA [5].
[14] compared the 146 size of the convolution kernel matrix with the memory band-147 width, thereby providing the basis for data partitioning and 148 solving the problem of uneven resource allocation. Loop 149 tiling and loop unrolling consider both on-chip data reuse 150 and external storage and can be used to improve mapping 151 efficiency and resource utilization [15]. [12] noted that the 152 scheme of providing convolution processing unit for each 153 stage of the convolution computation [16] is inefficient, and 154 therefore proposed multiple multiprocessing structures with 155 different computing capabilities to pipeline the processing. 156 [17] designed a scheduling algorithm based on the maximum 157 value to improve the resource utilization problem under multi 158 process element cooperative computing. A reloadable archi-159 tecture of neural network was proposed in [18]. The above 160 scheme is simple and direct, but it is only effective on specific 161 operators, and the generality is poor.   Addressing scheduling problems can greatly optimize the 217 model mapping so that both computing and storage resources 218 can fully utilize their capacity. However the existing schedul-219 ing optimization schemes lack generality, and most schedul-220 ing solutions can only optimize some specific operators. 221 Therefore, it is necessary to find a general solution to solve 222 various neural networks consisting of complex operators and 223 their variants [24], [37].

225
The amount of data transmission in the NNs is the key to 226 optimizing performance. To determine the actual quantity of 227 data to be transported, it is necessary to obtain the matrix size, 228 density, and sparse algorithm features. Therefore, we analyze 229 the storage characteristics of the matrix according to the 230 network model and dataset and find the optimal compression 231 storage scheme by comparing the original storage capacity 232 and the sparsely compressed data storage capacity. After 233 the compression scheme is determined, the memory access 234 footprint of the model is calculated, and the partition size with 235 the largest throughput is selected according to the footprint. 236 Finally, the enumeration method is used to select the loop 237 scheduling scheme.

239
The feature analysis module calculates the storage g of 240 the compressed matrix, compares the value of G comp 241 with the original uncompressed storage amount G orig , and 242 selects the strategy that occupies the smallest storage space. 243 G comp is a linear function of the number of nonzero 244 The matrix feature extraction module obtains the matrix   Table 2.

274
The layers close to the input layer have more original infor-  Therefore, for a denser input-side matrix, the compressed 279 storage cost is relatively high. Therefore, it is necessary to 280 judge whether the compressed storage amount will exceed 281 the original size, that is, G comp >G oirg . If the above situation 282 occurs, the original storage without compression is selected. 283 The size of the neural network model usually exceeds the 285 on-chip storage, and the matrix partitioning technique is 286 needed to transfer data to FPGA. Ideally, the partitioned 287 input matrixT, output matrix, and weight matrix are stored 288 in the on-chip buffer until the partial sum is completely 289 superimposed and then written back to the memory. The 290 total execution latency includes memory access time and 291 computation time. The memory access time is the product 292 of the bit transmission delay and the amount of transmitted 293 data. The quantity of transmitted data is the sum of the input, 294 weight, and output matrix transmissions. The amount of each 295 matrix X transmission is the matrix transmissions λ X and the 296 single transmission data amount µ X , as shown in Table 3. 297 P in is the partition size of input and Pout is the partition size 298 of output, and Our propose is finding the best partition size 299 P in and P out 300 The total transmission volume is calculated as shown in 301 Equation (5): The calculation of the memory access time MAT is shown 304 in Equation (6), where A is the transmission time of single-305 bit data, which is only related to the system bit width and 306 bandwidth.

Memory Access Time
The computing time is obtained by multiplying the total 309 computing partitions and the computing time of each par-310 tition, where the computing time of each partition is repre-311 sented by the data size γ divided by the frequency f, and γ is 312 obtained by dividing the total network parameter size by the 313 memory bandwidth B, as shown in Equation (7) Therefore, the final computation time is expressed as 316 Equation (8): Throughput is the total amount of transferred data divide 320 the latency. The transmission and computation of data are 321 often performed in parallel, so latency is the muximum of 322 TCT and MAT, then throughput is shown in Equation (9).
Partition selection is a multi-objective optimization prob- in Equation (10).
The objective function is shown in Equation (11).

334
Our propose is finding the partition size P in and P out , and the partition size P in can be represented by formulation (12) 342 Let y (T n ) = 0, and the right pole of the unary quadratic 368 function can be solved as the maximum value point of the 369 original function T n0 (14), as shown at the bottom of the next 370 page.

371
Whether the maximum value of P in falls in the range 372 (−∞, V 1 ]∩[1, N ) and ensures that there is an integer solution 373 for P out satisfying Equation (10-3), we discuss the following: 374 If P in0 ∈ (−∞, V 1 ] ∩ [1, N ], then compare y (P in ) | P in =1 375 and y (P in ) | P in =P in0 , and choose the largest result, 376 If P in0 ∈ (N , +∞), then compare y (P in ) | P in =1 and 377 y (P in ) | P in =min(N ,V 1 ) , and choose the largest one as the final 378 result, If P in0 ∈ (−∞, 1), P in = 1 is the final result because 379 when P in > P in0 , y (T n ) decreases monotonically in the 380 interval [1,N]. When P in ∈ (V 2 , +∞) ∩ [1, N ] , MAT > TCT , then the 383 throughput of the system is decided by TCT , and the formula 384 for calculating the MAT cost is shown in Equation (15).
It can be found that MAT is only related to P out . The larger 387 P out is, the smaller MAT is. Therefore, it is only necessary to 388 keep P out as large as possible. Therefore, let P out =M, and 389 use Equation (10-3) to solve the value of P out , which is the 390 proper partition to maximum throughput. Random dominance occurs when P in ∈ [V 1 , V 2 ] ∩ [1, N ], 393 there will be a P out that makes MAT=TCT, and the enumer-394 ation method is used to traverse the partition value range and 395 find the partition value.

396
The above optimal solution is a theoretical result, but 397 in practical systems, due to the storage capacity limitation, 398 some theoretical values may not be reached, which requires 399 adjusting the results to meet system resource constraints. The 400 adjustment scheme of the solution is based on the fact that for 401 any convolutional layer, when P in = P out = 1, the storage 402 overhead cannot exceed the memory capacity.

403
Based on the above analysis, the inverse solution function 404 P x V (·) is used for the problem that the current partition space 405 has no solution. According to the input partition size, the 406 partition size of another dimension is obtained when the equal 407 sign is taken according to Formula . For example, the 408 P x V (P n ) function returns the maximum value of P out , and 409 inputting P out returns the maximum value of P in . 410 VOLUME 10, 2022 Taking the solution adjustment of the MAT-dominated 411 interval as an example, the throughput is only related to the 412 value of P out , and P out needs to be as large as possible. 413 However, since P in has a lower boundary and P out cannot    Table 4, and the 440 footprint calculation formula is shown in Equation (16).

441
After the matrix is sparsely compressed, denote K X as the 442 sparse storage footprint factor of matrix X, and the footprint 443 calculation formula is shown in Equation (17)- (18).  (20), where Trace X is the memory footprint of matrix X, 456 Trace is the total memory consumption of memory mapping, 457 and n L j is the iteration number of loop L j .
458        We compared our scheme with the average partition scheme. 493 The average partition scheme does not consider the impor-494 tance of matrix channels but only considers the matching 495 of the number of input matrices and convolution kernels, 496 so let Tm=Tn to reduce the computational time overhead and 497 increase throughput. This section compares the throughput 498 and memory access overhead of different partition meth-499 ods at various layers of the network. The results are shown 500 in Figures 4-6.

501
The following conclusions can be drawn from the experi-502 mental results. First, our maximum partition method achieves 503 higher system throughput than the average partition method.