Hardware Acceleration of a Generalized Fast 2-D Convolution Method for Deep Neural Networks

The hardware acceleration of Deep Neural Networks (DNN) is a highly effective and viable solution for running them on mobile devices. The power of DNNs is now available at the edge in a compact and power-efficient form factor with the aid of hardware acceleration. In this paper, we introduce an architecture that uses a generalized method called Single Partial Product 2-Dimensional Convolution (SPP2D Convolution) which calculates a 2-D convolution in a fast and expedient manner. We demonstrate that the SPP2D architecture prevents the re-fetching of input weights for the calculation of partial products, and it can calculate the output of any input size and kernel with low latency and high throughput compared to other popular techniques. SPP2D based architecture can reduce the memory access and execution time related to input reuse by at least <bold>three</bold> times in comparison with the work done in Ardakani <italic>et al.</italic> (2018) and approximately <bold>nine</bold> times that of the standard sliding window approach. We have implemented the generalized SPP2D architecture on the Xilinx KC705 Kintex-7 evaluation board to illustrate that the new SPP2D algorithm is well-suited for the hardware acceleration of DNNs. We implemented LeNet-5 and VGGNet-16 using the SPP2D architecture. We demonstrate that the SPP2D based LeNet-5 has a high throughput of <bold>5 GOP/s</bold> and <bold>14.8 GOP/s/W</bold> and <bold>42 GOP/s/W</bold> for the convolution operation using the SPP2D IP. Our LeNet-5 design achieves a similar throughput to Zhou and Jiang (2015) however using <inline-formula> <tex-math notation="LaTeX">$\mathbf {3.3}\times $ </tex-math></inline-formula> fewer DSPs and an even smaller memory and lookup table (LUT) footprint. The SPP2D based VGGNet-16 network has a latency of <bold>91.3 ms</bold> which is <bold>79%,97%, 17%</bold> and <bold>95%</bold> less than contemporary designs respectively, while running at a low power of <bold>298 mW</bold> which is similar to the power level of these designs. The total processing time of our design with a parallelism factor of nine is <bold>3.93</bold> secs and it is <bold>70%</bold> less than that in Ardakani <italic>et al.</italic> (2018) and <bold>24%</bold> less than that in Panchbhaiyye and Ogunfunmi (2021). The SPP2D based LeNet-5 and VGGNet-16 accelerators provide a low-latency design with reduced memory access thus leading to a low-power design. As a result, SPP2D convolution is very well suited for hardware acceleration of DNNs.


I. INTRODUCTION
In recent years Deep Neural Networks (DNN) have become ubiquitous and therefore there are many variants that exist to accomplish a myriad of applications. These tasks range from object classification to natural language processing (NLP ). The robustness of DNNs to distortions and simple geometric transformations makes them highly effective for processing images [1].
The associate editor coordinating the review of this manuscript and approving it for publication was Qi Zhou. DNNs are becoming increasingly large and complex. Therefore, the hardware architectures that implement them need to be commensurate with their growth while delivering the cutting-edge throughput with minimum latency. The demand for low power and low memory access has also become more challenging with their growth. Hardware acceleration of DNNs on FPGAs and ASICs are more energy efficient and portable than GPU implementations (at least for inference) [2].
Hardware imposes tremendous limitations on the design of DNNs owing to its high computational complexity. VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Low latency concerns and the need for a high memoryaccess bandwidth also pose challenges in CNN accelerator designs. It is challenging to reap the complete benefits of logic resources within the architecture even after implementing pipe-lining and time-efficient designs leading to a sophisticated architecture being under-utilized. Implementing DNN architectures involves, meeting the peak performance in the fa ce of the aforementioned limitations. Therefore, the hardware implementation of DNN calls for a huge design exploration.
Recently there has been a migration of inference applications to FPGA devices. FPGAs provide a flexible sandbox for development and deployment of DNNs for inference. This provides an excellent platform for testing rapidly evolving DNNs [20]. FPGAs are also cost effective in terms of energy savings. They deliver better performance for the same amount of energy spent to perform DNN computations compared with GPUs [21]. For all these reasons, FPGA hardware acceleration is appropriate for inference. The CNN accelerator described in [7] is a combination of hardware and software design processes. In [17] the authors explored the design space of the available loop blocking parameters using the roof-line model. They determined the optimum loop unrolling parameters based on design space exploration. The loop unrolling factors are variable in nature, and the authors select a fixed unroll loop parameter that tolerates a degradation of 5% in throughput and performance. In [22] we focused on increasing the computational throughput using a novel processing unit design called nested processing element (NPE) using the variable unroll parameters calculated in [17] to achieve a throughput enhancement of 94%. Eyeriss [18] is a design in which the authors limit external memory accesses using a memory hierarchy. For this purpose, they used local scratch pads and global buffers. They achieved energy efficiency by adopting a reuse scheme known as rowstationary. Energy efficiency is the result of reducing external data access.
There are many designs that approach the problem from both the software and hardware fronts. Studies such [23] have investigated the effect of quantization on the accuracy of models. Optimizations such as quantization minimize the storage of input feature maps and weights so that they can fit on the on-chip memory [24]- [26]. Our work in [27] explored the effect of fixed-point quantization on the accuracy of DNNs. The pruning and quantization of weights results in sparse networks. Sparse networks generate associative data that is inconvenient to handle. Their increased complexity makes it difficult to use traditional methods such as matrix multiplication and calls for tailored solutions. Therefore, there is a need for simple solutions that can exploit existing dense computation techniques. All of the techniques described above aim to reduce the number of memory transfers in creative ways.
A commonly used operation in Deep Neural Networks is 2-D convolution. The number of calculations required to implement a deep neural network is of the order of millions [28], [29]. Many solutions such as [30]- [33] exploit sparseness to overcome this impediment. Works such as [34]- [41] reduce the inputs and weights to binary values leading to a smaller memory footprint and less memory transfer. Works such as [31], [42] use Winograd's filter technique to calculate convolutions [16]- [19] in a fast manner. A fast Fourier transform can also be used to calculate the convolutions in an expedient manner [43].
The convolution operation is at the heart of deep neural network and therefore research in alternate ways to implement this operation is relevant and opportune owing to the rising trend of deep learning. There are several methods for performing the convolution operation. Convolutions can be implemented using matrix multiplication and vector multiplication [2]. However, convolution using matrix multiplication introduces redundant operations by converting the input matrix into a Toeplitz matrix and convolutions using vector multiplication take a long time if done serially or require large memory transfers if the operation is carried out in parallel. Solutions such as [18] propose the reuse of input weights to avoid fetching them from the off-chip memory. Reuse is an important mechanism in implementing DNNs as the traditional sliding window method without any reuse results in fetching some input pixels and weights multiple times. In [16], the authors presented an architecture that aimed to reduce the latency of networks by implementing the reuse of input pixels. In [44] the authors designed a convolution method that removed redundant multiplication operations from both 1-D and 2-D convolution computations at the cost of increased addition operations. In [45], we performed a redundancy analysis of the weights in order to avoid repeatedly sending similar data from off-chip to on-chip. In [46], we implemented an architecture to perform SPP2D convolution. The design was able to perform 2D convolution of an input of size 5 × 5 with kernel 3 × 3 in approximately 9 clock cycles. Although, being fast the design had several limitations. We addressed those limitations to deliver a complete SPP2D convolution-based CNN architecture in this work.
In this paper we present an SPP2D based architecture on Xilinx's KC705 Kintex 7 evaluation board. We implemented the LeNet-5 and VGGNet-16 networks using the SPP2D architecture. This design can implement a 2-D convolution of an input of any size with a kernel of any size. Our architecture design addresses the two prominent concerns in this area of research -computational complexity and latency and power consumption due to memory movement. In this work, we contribute to the following: • We have designed an architecture based on SPP2D convolution which has a generalized and improved processing engine that can compute the outputs of a 2-D convolutions by avoiding the re-fetching of any input for the calculation of any partial product.
• The SPP2D architecture supports the convolution of an input of any size and a kernel of any size which leads to a network-agnostic architecture design.
• We implemented the LeNet-5 and VGGNet-16 network using SPP2D architecture. We implemented VGGNet-16 using a parallelism factor of 1 and 9. The LeNet-5 design handled the same workload as [47] with a smaller resource footprint and higher throughput than in [48], in which the authors presented a design with a reduced number of parameters. The SPP2D based VGGNet-16 design achieved a low total latency of 91.3 ms which is lower than [11]- [13], [16], [18], [19] at 298 mW for a parallelism factor of 9.
This promising concept helps resolve latency issues previously experienced and observed in other popular designs such as [16], because it avoids the re-fetching of input pixels for the calculation of partial products. This results in the low overall number of operations required to compute a frame. The remainder of this paper is organized as follows: In Section III we explain the concept of the Single Partial Product 2-D Convolution and its analysis in Section IV. We explain the hardware architecture in Section V and the implementation process of LeNet-5 and VGGNet-16 in Section VI. This is followed by the results and conclusion in Sections VII and SectionVIII.

III. SINGLE PARTIAL PRODUCTION 2-D CONVOLUTION A. 2-D CONVOLUTION OPERATION
The 2-D convolution operation can be described using a sliding window operation. Consider an input I of size N × N and a kernel W of size k × k. We can describe the output O of size M × M where M = N − k as the output of the convolution of input I and kernel k with a stride of 1 ( Figure 1). There is a lot of information that can be gleaned about the movement of data during the sliding window operation. By analyzing, the interaction of the input pixels' with the kernel elements, we can understand how many times an input is required for calculating the output. This has broad implications in terms of memory requirements and dealing with the burden of data movement in the case of the CNN architecture design process.

B. FREQUENCY OF REUSE
The convolution operation involves the reuse of the input pixels in calculation of partial products. We define the number of times an input pixel is required to calculate the output as the frequency of reuse of that input pixel. The reuse of the input pixels in calculating the output pixels influences the design of memory traffic infrastructure in hardware for acceleration. We denote the frequency of reuse as N(x) where x denotes the frequency itself. Note: Here N(x) is not the same as N which is the number of input pixels. The frequency of reuse for all the pixels in a 5 × 5 input convolved by a kernel of size 3 × 3 is shown in Figure 2a. There are some key takeaways such as the maximum frequency of reuse is N(9) and the minimum frequency of reuse is N(1) for an input of 5 × 5 and  Figure 2b. The maximum frequency is N(9) and the minimum frequency is N(1). The maximum frequency of reuse pixels lie in the ''middle'' of the input and all the other lie on the boundary. Consider two more inputs of size -9 × 9 and 10 × 10 and kernel of size 5 × 5 (Figure 2c and Figure 2d). The maximum frequency of reuse is N(25) and the minimum frequency of reuse is N(1). Some conclusions can be drawn from all the above examples: • This pattern of frequency of reuse is universal for a kernel of size k and an input of size greater than or equal to (2 * k − 1) × (2 * k − 1) with a stride of 1.
• The minimum frequency of reuse is N(1) and the maximum frequency of reuse is N(k 2 ) for a kernel of size k × k and an input of size greater than or equal to Figure 3).
• The maximum frequency always lies in the ''middle'' • All other frequencies are present at the periphery of the input spanning (k − 1) pixels along the entire boundary where k is the kernel dimension.
• The ''corner''pixels are fixed in number. As the size of the input increases, it has no effect on the number of these pixels. The ''edge''and ''middle''pixels grow with the increase in input size ( Figure 2b and Figure 2d). For example, in Table 1, we can see how the ''edge''and ''corner''pixels grow in number with the increase in size of input and the ''corner''pixels remain fixed. Figure 3 shows how the number of pixels in each region can be determined and Table 2 summarizes the number of input pixels that correspond to each region (''middle'', ''edge'' and ''corner'') for convolution between an input of any size with a kernel of any size. The size of the ''middle'' region is given by (N − 2(k − 1)) 2 . The ''edge'' pixels are given by (N − 2(k − 1))(k − 1) along each edge and there are four edges and the four sets of ''corner''pixels are given as (k − 1) 2 , where N is the size of the input and k is the size of the kernel.
The work done in [16] deploys the reuse of input pixels to combat the refetching of input pixels for the calculation of partial products. In their design, they calculated the output by preserving the partial products calculated between two consecutive outputs and only calculating the new partial products introduced due to the stride. They reused input pixels in their design, however there is still room for reuse. This paper also reported a large amount of latency as the size of the input increased. Based on our observations of the input pixels and their frequency of reuse, we can avoid completely re-fetching the input pixels. We designed an architecture that exploits this strategy to avoid the re-fetching of the input pixels, and extract the maximum use of the input pixel while it is in the on-chip memory or buffers. The strategy involves calculating all partial products that an input would generate while the input pixel is in the buffer or on-chip memory before it is discarded to the off-chip memory. The process of simultaneous generation of partial products is described in the following paragraph.
Following the example of an input of size 5 × 5 being convolved with a kernel of size 3 × 3 ( Figure 4), we organize the input pixels in the descending order of their frequency of reuse and multiply them with the respective kernel weights necessary to produce their respective partial products as shown in Table 3. We can aggregate the partial products being generated to arrive at the output theoretically at N × N iterations where N is the size of the input. The partial product aggregation process is described in the Table 3. In Table 3 we arrive at the output in 25 aggregation iterations. There are a few design critical takeaways from Table 3. The first takeaway is that 9 weights are used with 9 multipliers and input pixel i12 occupies all these multipliers while all the other input pixels leave gaps or not occupy all the multipliers. The second takeaway from Table 3 is that there are complementary sets of inputs that can occupy all 9 multipliers, and they are highlighted with similar colors. The complementary sets are as follows: The complementary sets of inputs help engage all 9 multipliers. This gives us an opportunity to combine the input with complementary sets and parse them into 9 multipliers with weights. This is presented in Table 4. In this manner, we can engage all the 9 multipliers corresponding to the number of weight kernels. The combined process of simultaneous generation of partial products and combining the inputs to the complementary set helps us reduce the aggregation cycles from 25 to 9. The color scheme described in Table 4 represents the aggregation pattern. The accumulation of all the similar colors in the Table 4 provides the output. For example the accumulation of partial products-w0i0, w1i1, w2i2, w3i5, w4i6, w5i7, w6i10, w7i11 and w8i12 gives the output pixel o0 (Figure 4)

C. TYPES OF INPUT PIXELS
Examining the pattern of frequency of reuse for inputs of different sizes we can conclude that we can classify the pixels into 3 broad categories -''middle'', ''edge''and ''corner''.In addition to the broader classification shown in Figure 3, we divided the pixels into finer categories. This gives rise to the (2k − 1) 2 category of pixels that is 25 types of pixels for an input of any size and kernel of size 3 × 3 and 81 types of pixels for an input of any size and kernel of size 5×5. Figure 5 depicts the categories of the input pixels T 0 to T 24. The pixel of type T 12 lies at the center of the input. The types of input pixels are classified to facilitate the organization of the input data for hardware processing. Pixel-type classification enables us to have a second level of discrimination of pixels in addition to the frequency of reuse. As the size of the input increases (and the kernel is 3 × 3) the number of pixels of T 12 with frequency N(9) also increases. This type of pixel needs to be combined with all 9 weights. Therefore, the T 12 type pixel can be classified as the ''middle''. A large chunk of the input must be multiplied by all the 9 weights. The second major categories of input pixels are - As stated earlier, the number of pixel types depends on the size of the kernel k and is given by (2k − 1) 2 .

D. SPP2D CONVOLUTION OPERATION
We can explain SPP2D Convolution using an example. Consider an input of size 5 × 5 and a kernel of size 3 × 3 described in Figures 6a and 6b. The convolution operation is done with a stride of 1. We highlight the input pixels using a color scheme that denotes the frequency of reuse. The kernel weights are used in an unfolded manner, and they can be aligned in the form of a vector multiplier. The input pixels are combined with the kernel weight in the optimized order as follows: . This is illustrated in Figure 7a. Once the partial products are generated they must be sorted into their various output pixels. The aggregation order is given by the color scheme described in Figure 7b. For example, The output o0 is the aggregation of sum{(3 × 1), (8 × 1), (7 × 3), (2 × 3), (2 × 2), (5 × 2), (5 × 1), (8×1), (6×2)} is 77 as seen in Figure 6c. The final output shown in Figure 6c is the result of the convolution of the input matrix ( Figure 6a) and kernel (Figure 6b).

IV. ANALYSIS OF THE SPP2D
We can calculate the number of times the inputs will be operated by the sliding window technique by aggregating the frequency of reuse for each input pixel. This can be a strong estimate of the number of cycles required to perform the convolution operation using the slidingwindow technique. This is shown in Table 5 -columns (a and d). The authors in [16] described an equation to calculate the number of clock cycles required to perform the convolution operation using their architecture. It is described in columns (b and e) in Table 5. Following the example of an input of size 5 × 5 being convolved with a kernel of size 3 × 3 we can analyze the number of clock cycles required to complete the SPP2D convolution based on the size of the kernel. We obtained the clock cycle estimate using the following formula:   we use the factor 1 2 since inputs with frequency N(6) and N(3) (''edge''pixels) are used together therefore, we need to divide the sum of them by 2 as we use them together. In Equation 4, we use the factor 1 4 since inputs having frequency N (4), N (2) and N (1) (''corner''pixles) are used together and therefore we need to divide their number by 4 as we use them 4 at a time (N(2) is used 2× in a corner arrangement). Finally, the number of clock cycles for the SPP2D convolution is given by summing Equations 2, 3 and 4. The total number of clock cycles is given by Equation 5. We can calculate the number of clock cycles for an SPP2D convolution of input of any size with a kernel of any size and a stride of 1 using Equation 5. We can compare the number of clock cycles of the SPP2D convolution using the sliding window convolution technique and the convolution operation in the architecture described in [16] in Table 5. We have demonstrated that the number of clock cycles it takes to perform convolution on 5 × 5 with 3 × 3 takes 9 clock cycles to generate an output of size 3 × 3 (column c in Table 5) and it takes 25 clock cycles if the output is of size 5 × 5 (column f in Table 5). This can be scaled for inputs with higher dimensions such as the input dimensions of the layers of VGGNet-16 network. Columns a, b, c indicate the number of clock cycles required to perform a convolution operation on the input of a given dimension using the sliding window, [16], and SPP2D convolution (w/o padding) and columns d, e, f indicate the number of clock cycles required by the sliding window, [16], and SPP2D convolution (with padding). From Table 5 we see that the SPP2D convolution requires approximately 3× (columns (b/c), (e/f)) less clock cycles than [16] and 9× (columns (a/c), (d/f)) less clock cycles than the sliding window technique for both operation types -w/o padding and with padding.

of inputs with N(4) + No of inputs with N(2)
+No of inputs with N(1)) = (k − 1) 2 (4) SPPD clock cycles where N is the dimension of the input and N(x) is the frequency of reuse.

V. HARDWARE ARCHITECTURE
The architecture of the SPP2D engine is described in Figure 8. The design is implemented on Xilinx's Kintex KC705 Board. The architecture design has many features that are common to most computer architectures such as pipelining etc. However, it also has some custom components that are designed to facilitate SPP2D convolution. It has input and weight buffers to store part of the input and kernel weights on the board. They are read from external memory which is the off-chip memory. The Input stream block organizes the input pixels in the order which is optimal for processing them. It classifies inputs of all sizes into the types of pixels (which are T 0 to T 24 for a kernel of size 3×3 and T 0 to T 81 for a kernel of size 5×5 ).We have implemented VGGNet-16 and LeNet-5 which have kernels of size 3 × 3 and 5 × 5 respectively. The Mux Array takes the input pixels as they are delivered in the form of pixels of type-T 0 to T 24 and selects the complementary sets from them. The output of the Mux Array is fed into the Multiplier along with the weights from the Weight Buffer.
The Decoder block sorts the output of the Multiplier.After sorting, the partial products are accumulated in the Accumulator block.After accumulation,they get stored in the Output Buffer and then eventually back to External Memory.

A. INPUT STREAM
The input stream block of the architecture controls the data-flow and parsing of the input pixels. The input was divided into rows and the order in which it needs to be     N(9). The input pixels have a property of being fixed or variable. The strategy for parsing the input pixels is based on the property of fixed or variable input pixels. We align the inputs ensuring that the rows 0, N − 2, 1, N − 1 are processed first making sure that the fixed number of pixel are processed first then the variable number of pixels. Once we process rows 0, N − 2, 1, N − 1, we can process rows 2 to N − 3. The input parsing scheme is illustrated in Figure 9c. Figure 10 shows the optimized input stream order for an input of 5 × 5 and kernel of 3 × 3. The order of the new optimized stream is based on using rows with fixed pixels first then the rows with variable type of pixels used last. The parsing scheme is similar for pixels within a row. The fixed pixels are used first followed by the variable pixels. It takes 9 compute stages to finally arrive at the output. The strategy of processing a fixed number of pixels followed by the variable number of pixels in the complementary rows can be adopted to process an input that is convolved with a kernel of any size. Consider an input of any size that is convolved with a kernel of any size (Figure 11a), the rows 0,1,N − 2 and N − 1 are processed first (Figure 11b) and then the rows 2 to N − 3 (Figure 11c). For these rows, the fixed pixels are processed first followed by the variable pixels. Therefore, making the process of organizing input data generalized irrespective of the size of the input and the kernel with which it is being convolved.

N(3), N(6) and
The Input stream organization works by classifying the pixels into types of pixels as shown in Figure 5. Given any size of input, the input stream is organized into 25 categories for a kernel of size 3 × 3. The input is classified into (2k − 1) 2 for a kernel size of k. In Figure 12a, we see an example of a 5 × 5 input (Figure 12b) classified into the 25 types of pixels. The parsing of complementary rows 0 and N − 2 of input of size 5 × 5 is described in Figure 12a and 12c. In Figure 12d we see the input of size 6 × 6 ( Figure 12e) classified into the 25 types of pixels. The parsing of complementary rows 0 and N − 2 of input of size 6 × 6 is described in Figure 12d and 12f. The input pixels are parsed in a generalized manner. The pixels stream corresponding to the compute stages defined in Figure 10 and it can be seen in Figure 12a. There are 9 compute stages for an input of size 5 × 5. As the input size increases, the compute stages will have echoes as types of pixels that are variable in nature grow. Looking at Figure 12d we can see that compute stages 1 and 2 are followed by compute stage 3 and 4. Compute stages 3 and 4 process complementary pixel set {T 2, T 17}. There are 2 pairs of pixels in this complementary set. This repeats for compute stage 7 and 8. Furthermore, rows 3 and 4 have pixels that have multiples of a type of pixel T 10, T 11, T 12, T 13 and T 14. Therefore we can see the entire row echoing in computation.

B. MUX ARRAY
The mux array processes its input only if all the inputs are present and available. The number of multiplexers in the mux array is given by k 2 where k is the kernel size. In general, we design the mux array based on the largest kernel size in the architecture. The mux array as an example of the input of 5 × 5 being convolved with a kernel of 3 × 3 has nine 9 to 1 muliplexers. The output of the mux array delivers the input pixels to the multipliers. The inputs to the all the muxes in the Mux Array are in0 to in8 as seen in Figure 13. The order in which the inputs to the mux are delivered is described using the optimized input stream described in Table 4 and also depicted in Figure 13.
Consider we have the pixels from complementary rows 0 and N − 2 (Figures 9a) available at the output of the input stream block. These pixels are arranged at the in0 inputs of all the muxes of the Mux Array ( Figure 13). The next set of complementary rows are arranged at the input in1 of all the muxes of the Mux Array. The entire arrangement of the Mux Array is based on the parsing strategy described in Section V-A. The design is such that we can send a select signal to the Mux Array which sends all the mux inputs in the ascending order.

C. MULITIPLIER
The multiplier block shown in Figure 14 is fairly straightforward. Its function is to accept the input pixels delivered by the mux array and multiply them by the weights. The partial products generated are combined in a bus for sorting and sent to the decoder block. Booth multipliers were used to perform the multiplication.

D. DECODER AND ACCUMULATOR
The decoder component of the architecture sorts the outputs generated by the multiplier. Once the partial products are generated they must be combined with relevant partial products for each output pixels. The sorting process is described in Figure 7b. The hardware used to sort the output of the convolution of an input of size 5 × 5 with a kernel of size 3 × 3 is described in Figure 15. It consists of nine 9 to 1 multiplexers that accept all nine partial products (pp[8:0]) generated using the multipliers. The select signal for each of these muxes are provided by the input stream block. These mux select codes are known due to the analysis done in Section III-D. Once the partial products are sorted into z 0 to z 8 they can be aggregated to obtain the final outputs o 0 to o 8 .
If the input size increases, the output size will also increase and thus sorting the partial products becomes more complicated. Consider, an input that generates an output of pixels higher than 3 × 3. In Figure 16, we see that compute stages-1,2 and 3 which are also described in Figure 10 generate a pattern of partial product delivery to the designated output locations. We can see that compute stages 1, 2 lead to partial products that occupy the output corners. These positions are fixed for these compute stages. However, compute stage 3 can be variable and the output pixels will slide along the columns so that it can accommodate pixels generated with complementary sets {T 2, T 17} ( Figure 16). This is similar for compute stages 4,5 and 6 ( Figure 17). Once we have dealt with rows 0, 1, N − 1 and N − 2, we process rows 2 to N − 3. Rows 2 to N − 3 engage compute stages 7, 8 and 9 ( along with echoes). These rows contain pixels that can grow in number with the size of the input. For compute stage 7 and 8 (and their echoes) we populate output pixels on the edge ( Figure 18). Therefore, output pixel for compute stages 7 and 8 slide row wise. For compute stage 9 (and its echoes) the partial products are sliding along both rows and columns. This pattern holds for any size input being processed by a kernel of any size. Consider an input of any size and kernel of any size the sort pattern for rows 0 and N − 2 is given in Figure 19. The hardware accelerator for input of any size and kernel of any size produces (k 2 ) partial products given at a time (pp 0 to pp (k 2 −1) ). We see that the fixed pixels generate partial products that will go to a fixed location and the variable partial products will contribute to variable locations. The sorting pattern for rows 1 and N − 1 is given in Figure 20. The sorting pattern for rows 2 to N − 3 is given in Figure 21.

VI. IMPLEMENTATION
We implemented LeNet-5 [49] and VGGNet-16 using SPP2D architecture on the Xilinx's Kintex KC705 development VOLUME 10, 2022     board. We implemented the LeNet-5 design without any parallelism as it a small network. The VGGNet-16 network was implemented both with and without parallelism. The LeNet-5 architecture consists of: two pairs of convolutional layers, average pooling layers, followed by two fully-connected layers ( Figure 22). The design is implemented at a frequency of 200MHz. The design has 25 DSP48s owing to the size of the kernel (k × k) which is 25. The other resources used in the design were 1901 LUTs, 3073 FFs, and 8 BRAM blocks.
The VGGNet-16 architecture consists of 13 convolutional layers and 3 fully connected layers. This is illustrated in Figure 23. The VGGNet-16 design is implemented at a frequency of 100MHz. We implement the design with a parallelism factor of1 ( Figure 24) and 9 ( Figure 25). The design with a parallelism factor of 9 can process 9 kernels of size 3 × 3 simultaneously as it had 9 parallel SPP2D engines compared to the design with parallelism factor 1.

VII. RESULTS
The throughput and performance of LeNet-5 and VGGNet-16 are described in Table 6 and Table 7. In Table 8 and Table 9 we compare the results for LeNet-5 and VGGNet-16 with other contemporary designs. In Table 6, we report the following -    parameters required for each layers, multiply accumulate operations (MACs) operation for each layer and combinations, execution time for each layer and combination, performance (which is the reciprocal of execution time), GOP/s for each layer and combination. Finally, we report GOP/s/W for the entire system and GOP/s/W for the SPP2D IP. If a machine has an execution time of 1 sec, the performance metric indicates how much faster our design is compared to it. Table 6 breaks the results down to individual layers and as well as the following combinations -both convolution VOLUME 10, 2022  layers (Conv1, Conv2), both fully connected layers (F1,F2) and all the layers (Conv1, Conv2, F1, F2). The design gives a throughput of 14.8 GOP/s/W (given onchip power) or 42.7 GOP/s/W if we consider the power consumption of the SPP2D IP without the peripherals for the both convolution layers. We considered the throughput for the convolution layers since the SPP2D architecture applies to the convolutional layers. The power consumption of the system is 0.337W and that of the SPP2D IP is 0.117 W. The SPP2D implementation of the LeNet-5 architecture is compared with two designs described in [47], [48] in Table 8. The design in [48] is based on reducing the parameters in the CNN design which decreases the footprint of the design while also increasing the throughput. The design in [47] is implemented at 150MHz and has a throughput of approximately 14.8 GOP/s. We observed that in our design we used a lower number of DSPs than [47] with a comparable workload and throughput (GOP/s). Compared to [48] we observed a better GOP/DSP. We definitely use more DSP48s than [48] but it is a small concession to make given the availability and abundance of the DSP48 blocks. The on-chip power consumed by the SPP2D architecture is 0.337 W which is very competitive with current designs.
We compare our VGGNet-16 implementation with many contemporary designs. The results of this implementation are described in Table 7 and Table 9 respectively. Table 7, describes the throughput and performance of each layer of the VGGNet-16 network as well as the complete network. The total latency of the network is 91.3 ms. The number of operations per frame the network has to perform is 15.5 GOP.
These metrics remain the same for the network with a parallelism factor 1 and 9. The parallelism helps simultaneously process multiple kernels with an input. The number of operations per second (GOP/s) for parallelism of 1 is 0.448 GOP/s with an execution time of 34.6secs. The number of operations per second for a parallelism factor of 9 is 3.96 GOP/s with an execution time of 3.93 secs. Finally, we compare the performance of the SPP2D based implementation with other relevant designs [11]- [13], [15], [16], [18], [19] in Table 9. SPP2D based VGGNet-16 design has a low latency of 91.3 ms which is 79%, 97%, 17% and 95% less than that of conventional designs [11], [15], [16], [18] while the power consumption is 291 mW and 298 mW for parallelism of 1and 9 respectively. The power consumption is low and comparable to or better than that of other designs. Our design has a lower overall operation per frame of 15.5 GOP since our design aims to reduce re-fetching of data and also limiting the number or calculations. Thus the throughput of our network is lower compared to other designs (0.448 GOP/s, 0.0289 frames/s for parallelism of 1 and 3.96 GOPs/s, 0.225 frames/s for parallelism of 9). The total processing time of our design with a parallelism factor of 9 is 3.93 secs and it is 70% less than [16] and 24% less than [15]. The advantage of our design is that we avoid the re-reading of input pixels, thus reducing power consumption due to huge memory traffic. However, as a result, we need to wait for all the partial products to be calculated for the output to be available. This puts a lower limit on the latency of our design for each input size. In addition to this, we need a buffer space which is the size of the output feature map size, which depends on the biggest output feature map size. This can cause a strain on the SRAM budget of a design. SPP2D based architectures are suitable for an input and kernel of any size and we have demonstrated it for a stride of 1. We plan to update the SPP2D architecture to incorporate variable strides in a future iteration of the architecture. In summary, we demonstrated that an SPP2D based hardware accelerator can deliver low latency with low power and can be a competitive choice to implement any deep neural network.

VIII. CONCLUSION AND FUTURE WORK
We have presented the SPP2D convolution algorithm which is a fast and efficient method of computing 2-D convolutions. We have demonstrated that it can process the input and kernel of any size. SPP2D based architectures have delivered a more generalized design to implement any 2-D convolution. SPP2D based architectures have provided tremendous savings in re-fetching input pixels for computing partial products compared to the reuse calculation described in [11], [15], [16], [18], [47]. as well as a low latency and high throughput solution for calculating convolutions. The SPP2D based LeNet-5 and VGGNet-16 validate the concept introduced in [46] and presented in detail in this paper. Furthermore, we plan to extend this research to other networks such as MobileNet and ResNet-50 thus proving that the SPP2D architecture is truly network agnostic and adaptable to any design.
ANAAM ANSARI (Graduate Student Member, IEEE) received the bachelor's degree in electronics engineering from the University of Mumbai. She is currently pursuing the Ph.D. degree with the Electrical Engineering Department, Santa Clara University. Her work involves researching new and efficient hardware architectures for convolutional neural networks. Immediately followed by her undergraduate career, she enrolled at San José State University (SJSU) into a graduate program in electrical engineering. She spent her graduate career focusing on wireless communications theory. She has participated in several projects ranging from robotics to wireless communication using software-defined radios. Before becoming a full-time Ph.D. student, she also worked at LSI Corporation as a Systems Engineer with the SerDes Architecture Team. She has also interned at autonomous driving companies, such as Velodyne and Waymo. Her current research interest includes new architecture paradigms for performing convolutional neural networks.