Energy-Efficient High-Speed ASIC Implementation of Convolutional Neural Network Using Novel Reduced Critical-Path Design

Convolutional Neural Network (CNN) plays an important role in several machine learning tasks related to speech, image, and video processing applications. The increasing demand for faster processing in real-time applications requires high-speed implementation of CNN. However, in general, CNN involves higher latency due to the computationally intensive behavior of the convolutional layer. While state-of-the-art architecture provides efficient dataflow of the convolutional operations, this paper proposes a hardware-efficient, high-speed convolution block for ASIC implementation of the CNN algorithm. The proposed convolution block is designed using a novel bit-level-multiply-accumulator (BLMAC) with a modified Booth encoder and a Wallace reduction tree. The critical path of the overall architecture is significantly shortened due to the time-optimized implementation of the proposed BLMAC, which is a main component of the convolution process. Critical path analysis and own dataflow strategy are also provided to demonstrate the acceleration of the proposed design. The proposed architecture was synthesized using Synopsys Design Compiler to prove accelerated processing power. The ASIC synthesis results of the proposed architecture using a 65nm standard cell library show at least 53.2% reduction in latency, 52.2% reduction in area-delay product, and 54.2% reduction in power-delay product compared to the state-of-the-art architecture.

The CNN is typically composed of a multiple convolutional layer followed by a pooling layer and a fully connected (FC) layer [8]. It is often required to have a large number of convolutional layers in CNN in order to increase the functional accuracy [9]. However, as the number of convolutional layers increases, so does the computational complexity and the cost of implementation. Since the computation of convolution involves more than 95% of the CNN operation [10], several efforts have been made to efficiently accelerate the convolution process. Ardakani et al. [11] propose an efficient architecture to realize convolutional layers of VG-GNet. In addition, they coherently integrate the dataflow of the convolutional process with the computational core of the FC layer of the state-of-the-art VGG architecture. Based on their validated dataflow, this paper proposes an area-timepower efficient processing element (PE) for convolutional layer computations. The proposed architecture employs a modified Wallace reduction tree (WRT) and a modified Booth encoder (MBE) with bit-level pipelining to accelerate the processing. The contributions of this paper are outlined as follows: 1) A novel high-speed bit-level-multiply-accumulator (BLMAC) based on modified WRT and MBE is designed to reduce the latency of CNN operations. 2) A novel dual-clock strategy is proposed to improve the hardware utilization as well as overall latency where the MAC operations are accelerated with a clock with a short period while their accumulation operates with a longer clock period. 3) An area-time-power efficient hierarchical structure of the processing element with bit-level error correction is proposed. 4) An efficient dataflow strategy is proposed to efficiently utilize neurons with lower latency. 5) Precise critical path analysis is performed and critical path delay is significantly reduced to accelerate the computation of convolutional layer. 6) The effectiveness of the proposed architecture is demonstrated by the results estimated through the design synthesis.
The rest of the paper is organized as follows. Section II introduces the CNN structure and research background for several CNN architectures. The VGG16 network and its convolutional behavior are also described in this Section. The architectures of the proposed BLMAC and processing element are presented in Section III. In Section IV, the dataflow design and latency analysis of the proposed architecture are presented. Section V provides the critical path analysis to find the appropriate timing constraints for effective hardware utilization and high-speed processing of convolution operation. In Section VI, the result of ASIC synthesis of the proposed architecture is discussed and compared with the existing design. Applications of the proposed architecture are also discussed in this Section. Finally, Section VII concludes this paper.

II. ALGORITHM AND ARCHITECTURE OF CNN
This section summarizes the operation of the CNN algorithm and VGGNet architecture with 16 layers, namely, VGG16. A widely used reference hardware architecture is also introduced for VGG16 implementation .

A. CNN AND VGG16 ARCHITECTURES
The CNN extracts the relevant features through a series of convolutional layers, max pooling layers, and fullyconnected (FC) layers. After passing through these layers, the input is turned into a single vector, which is easy to be processed for recognition and prediction, etc. The convolutional layer is the core building block of CNN since it is used to extract feature information by carrying out the convolution operation of the output from the previous layer with the weights of the current layer. To look at the functional aspect, the convolution process applies a set of learnable filters called kernels to the input to create a feature map [12]. Fig. 1 depicts the typical convolutional layer functionality of passing an input image through the kernel to produce an output image. The input and output images of the convolution operation are called input feature map IFMap and output feature map OFMap, respectively. Considering the 3×3 2-D kernel as shown in Fig. 1, the width and height of IFMap are 2 pixels larger than those of OFMap, which takes into account the padding of the boundary. The pooling layer is inserted after each convolutional layer, which reduces the number of parameters and speeds up by down-sampling the adjacent pixels to retrieve the optimal features. Features extracted by the convolution and pooling layers are flattened into a single vector whose element is composed of small details of the input image at high-level features. While the extracted high-level features could be connected to the output layer, a FC layer is finally used to map the extracted features into the desired outputs.
VGG16 is a CNN architecture proposed by Simonyan and Zisserman in 2015 [13]. At the ILSVRC 2014 competition, VGG16 showed outstanding results, taking second place in the overall event. The model achieved 92.7% top-5 test accuracy on ImageNet, a dataset of over 14 million images belonging to 1000 classes. The VGG16 network architecture is summarized in Table 1 [14]. The VGG16 architecture contains 13 convolutional layers grouped into five convolution sets. Each set uses 3×3 convolution filters across the  Table 1, Conv(3×3×m,N ) denotes the convolution of the input image with N 3×3×m kernels to create N output images. After each convolution set, a 2×2 max pooling is performed to reduce the output image size. Three FC layers are followed at the end of the last pooling layer, and the final Softmax function converts the values to indicate their relative importance.
As the study of [15], Quan Liu et al. demonstrated that the VGG16 has the fastest convergence behaviour, the shortest training time, and the highest functional accuracy in CNN performance. Based on this comparative study, the VGG16 is recommended as one of the most suitable models for the realization of CNN.

B. CONVOLUTION OPERATION
As the name of the algorithm implies, the convolutional layer is the most important and computation-intensive module in VGG16 [16]. The convolution operation extracts features of the image by multiplying each element of the kernel to the input image and adding the results together. The kernel moves through the whole IFMap by a suitable predetermined stride. To avoid loss of boundary information, the VGG16 architecture pads zeros around the boundary of the IFMap.
Algorithm 1 shows a pseudo code of a 2-dimensional operation in the convolutional layer, where element-wise multiply-accumulate (MAC) operation between the IFMap and the kernel is depicted. Note that the 2-D kernel size and stride in VGGNet are fixed to 3×3 and one pixel for the convolutional process, respectively.

Algorithm 1: Pseudo Code for 2-D Operation in Convolution Layer
As shown in Algorithm 1, CNN is basically based on MAC computation. Therefore, from a hardware implementation point of view, the optimization of MAC has a strong potential to speed up the entire CNN processing. Many studies have proposed efficient hardware implementations based on various CNN structures to achieve speed improvement [11], [17]- [20]. This paper focuses on optimizing the MAC computation of CNN and the PEs that contain the MAC unit. Therefore, a reference architecture of CNN is needed to verify the proposed MAC and the PE architecture. Arash Ardakani et al. [11] have proposed a novel computation of the convolutional layer using VGGNet with a 2-D kernel size of 3×3, and demonstrated better performance than other approaches [17]- [20]. Therefore, we have used the structure of [11] as a reference architecture to verify the performance of the proposed PE. However, since the CNN algorithms are basically based on MAC computations, the proposed PE can also be applied to other types of CNN implementations.

III. ARCHITECTURE OF PROPOSED PROCESSING ELEMENT A. REFERENCE ARCHITECTURE OF THE CONVOLUTION ENGINE
As can be seen in the literature, CNN hardware accelerators are of two basic types, e.g., (i) the fine-grained structure and (ii) the coarse-grained structure [21]. While fine-grained implementation uses a large number of small PEs, the coarsegrained architecture uses fewer PEs having more computing power. In this work, we propose a hierarchical architecture that can improve the overall performance, taking into account coarse-grained PEs with finer-granularity of operation within the MAC. Fig. 2 shows the reference architecture of a convolution engine (CE) using the proposed PE architecture. The CE consists of a weight generator and three PEs for 3×3 convolution, where each PE performs the MAC computations VOLUME   within one kernel window. Each of the PEs receives an input pixel and its weight from the weight generator. The dataflow indicating the order in which the input pixels and weights are fed in each clock cycle is described in Section IV.

B. ARCHITECTURE OF PROPOSED BIT-LEVEL MAC
Since the MAC operations comprise the core computation of the PEs, in order to overcome the low throughput drawback of the existing PE architectures, we propose here a speed-optimized bit-level MAC, namely BLMAC. The MAC unit generally performs multiplication followed by successive accumulation. In an integrated form, both the operations can be combined by realizing it through three different sections: (i) partial product (PP) generation, (ii) partial product addition, and (iii) output accumulation. We present here a novel architecture of MAC unit which is the core computing unit of the proposed PE as shown in Fig. 3. The MAC unit multiplies the pixel values in each row of the IFMap by a weight value, and then adds the product value to the accumulated result. Specifically, the Booth encoder can be used to generate the partial products (PPs) corresponding to the multiplication of an input pixel with a kernel value. The modified Booth encoder (MBE) proposed in [22] is used to reduce the number of PPs in half, i.e., m/2 instead of m, for m-bit multiplication. The performance of the CNN algorithm converges in 16-bit resolution according to [16], thus in the proposed BLMAC, m is set to 16. The pseudo code for generating eight PPs is given in Algorithm 2. Also, Fig. 4 shows the bits of each of the eight PPs generated by the 16-bit MBE.
Algorithm 2: Pseudo Code for Partial Product Generation of 16-bit MBE input : The product value is obtained by adding up all the eight PPs produced by the MBE. The simplest way to perform this is to use an adder-tree composed of seven ripple carry adders (RCAs). However, the propagation delay of k-bit RCA increases proportionally with k since where T FA is the propagation delay of a full adder (FA). Therefore, the delay of multiplication increases proportionally with the number of PPs. In order to accelerate processing in the proposed architecture, the Wallace reduction tree (WRT) is used to reduce the m/2 PPs to two PPs [23]. The main task of WRT is to group two or three bits at the same bit position and use a half adder (HA) or a full adder (FA) to reduce to two bits over two consecutive bit positions. This process continues until sum of the PPs are reduced to two words by the WRT. The dot diagram of WRT for reducing eight PPs generated by 16-bit MBE to two words over fourstages is shown in Fig. 5(a). The proposed BLMAC architecture does not employ RCA to add the final two words generated by WRT, to avoid a long propagation delay of RCA as required according to (1). In the proposed structure, these two words are stored in two separate registers (Register 1 and Register 2 as shown in Fig. 3). In the next clock cycle, two other words, corresponding to another multiplication, are generated by the WRT, and added to the previous result stored in Register 1 and Register 2. The process of reducing the four words to two is carried out by a 4-to-2 compressor denoted as CP1 (as shown in Fig. 3). The detailed dot diagram of the compressor is shown in Fig. 5(b). Considering 3×3 kernel filter, this process is performed over three clock cycles. The accumulated results are passed to other two registers marked as Registers 3 and 4 while Registers 1 and 2 are reset to receive the partial results corresponding to the next MAC operation. Finally, the proposed BLMAC computes three consecutive multiplications followed by accumulation to generate two partial results, P P 1 and P P 2 , and pass them to the next module to get the convolution sum with the 3-D kernel. Note that the product of two 16-bit numbers must be set to 32 bits, and the sum of these 3 products requires an additional 2 bits to avoid quantization errors. Thus, P P 1 and P P 2 are each set to 34  0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30   bits. Also, note that Registers 1 and 2 in Fig. 3 use a different clock source than Registers 3 and 4. Specifically, the wordclk period is set to be longer than 3 times the bit-clk period, thus each time when three products are computed, P P 1 and P P 2 are updated.

C. ARCHITECTURE OF PROPOSED PROCESSING ELEMENT
Based on the proposed BLMAC design, a high-speed PE architecture for the CNN is proposed (shown in Fig. 6). The proposed architecture consists of one BLMAC, two RCAs, a shift register, a 3-to-2 compressor (CP2), and a 5-to-1 multiplexer (MUX).
The BLMAC only produces the sum of the three products. Depending on the size of the kernel, the number of BLMAC outputs that must be accumulated to create one output pixel varies. For example, for a 3×3×64 kernel, the output of BLMAC should be accumulated 192 times. However, they are not generated in successive clock cycles. Therefore, a proper dataflow is required to be envisaged such that the corresponding accumulated value can be obtained from the shift-register successively. In VGG16, as shown in Table 1, the 32-bit product values need to be added up to 4,608 (=3×3×512) times. Therefore, the bit width of the shift register is set to 45 bits. The 3-to-2 compressor (CP2) and RCA1 are used to add the two outputs of the BLMAC and the output of the shift register, whose dot diagrams are shown in Figs. 7(a) and 7(b), respectively.
The output P P 1 and P P 2 of WRT (shown in the last stage of Fig. 5(a)), have bit indices of [31:0] and [31:5], respectively. Since the result of a 16-bit multiplication can be represented in 32 bits without overflow, the carry out can be discarded after adding these two partial results. However, in the proposed structure, the two partial results are not added, but are stored separately in two registers of BLMAC in order to limit the critical path. The pair of partial results (P P 1 and P P 2 ) of BLMAC are added with the results of the previous results of BLMAC available in the 45-bit shift register. Note that the carry out, which should have been discarded earlier during the addition of P P 1 and P P 2 , is accumulated. If the two inputs of MBE have the same sign, then 2 32 corresponding to the carry out needs to be subtracted. Even if the two signs are different, this subtraction should be done to prevent the sign of the multiplication result from changing due to zero extension. That is, a bias of 2 32 occurs each time a multiplication is performed, regardless of the two input signs of MBE. The number of multiplications is equal to the kernel size of the convolution layer, and VGG16 has 5 different sized kernels as shown in Table 1. Thus, depending on the convolutional layer being performed, one of the five precalculated correction vectors can be selected by a 5:1 MUX and subtracted from the output of the RCA1. The second ripple carry adder in Fig. 6, namely RCA2, performs this subtraction, and the corresponding dot diagram is shown in Fig. 7(c). Table 2 lists the binary representation of correction vectors used in the five different types of convolution layers. For example, for the 3×3×64 kernel, 576 multiplications are performed to produce one output pixel, so a correction vector corresponding to −576 needs to be added.

IV. DATAFLOW DESIGN AND LATENCY ANALYSIS
The convolution layer consists of as many neurons as the number of pixels in 2-D OFMap, and each neuron performs a MAC operation. In other words, considering the 2-D convo-lution of the VGG16 convolution layer, a total of W O × H O neurons (number of neurons to generate all the pixels in OFMap) is required. Assuming that the neurons read the input pixels in IFMap sequentially, one per clock cycle, and performs one multiplication-and-accumulation in a given clock cycle, the number of clock cycles required to get the whole 2-D OFMap is W I × H I × C I where W I , H I , and C I are the width, height, and number of channels of IFMap, respectively. Also, the number of clock cycles required to generate every C O output channel is When all these neurons are used in parallel, they occupy a large silicon area. In addition, the number of cycles in which one neuron actively performs MAC operations in the entire cycle is W W × H W × C I × C O , which is the size of the 3-D filter multiplied by the number of output channels. Therefore, it can be seen that the utilization of one neuron is significantly low despite the use of a large silicon area.
The reference architecture continuously recycles a limited number of neurons to reduce the silicon area [11]. That is, every pixel in OFMap is produced by serially reusing a subset of N neurons. In this case, the number of clock cycles required would be 3×(N is used in the first convolutional layer, then 3×9×3×64×224×224/7=37,158,912 clock cycles are consumed while 226×226×3×64=9,806,592 clock cycles are used in a parallel architecture. This means that the serial architecture increases the latency by 3.79 times compared to the parallel architecture, but can significantly reduce the silicon area. The value of N can be 7, 14, 28, 56, 112, 224 to fit the number of neurons with OFMap in the VGG16 architecture. Note that the selection of the size of subset N is a trade-off between silicon area and throughput. As shown in Fig. 2, the reference architecture consists of three PEs, and every pixel in OFMap is created with these three PEs regardless of the value of N . The proposed BLMAC unit speeds up the MAC operation by using the bit-clk with a short clock period, while the result of three consecutive MAC operations are synchronized to the wordclk with a period of 3 times the bit-clk period and stored in the Registers 3 and 4 in Fig. 3. The outputs of Registers 3 and 4 are added with the result of the previous BLMAC operation stored in the shift register in Fig. 6. Considering a hierarchical structure that uses these two clocks with different speeds, the proposed structure requires a different dataflow than that of [11]. As an example, Table 3 shows the dataflow of 3×3 2-D convolution operation of the 11th, 12th, and 13th convolution layers when N =14, that is, Conv(3×3×512,512). The size of 2-D IFMap becomes 16×16 considering zero padding at the border, and the size of 2-D OFMap becomes 14×14. Table 3 contains the details of the dataflow generating 14 output samples from the first row of OFMap, i.e., from O 0,0 to O 0,13 .
The BLMAC in the PE1 performs MAC operations over 3 cycles of bit-clk #1, #2, #3, and stores the result in the shift register. As a result, the calculations involving the first row of IFMap should be stored in bit-clk cycle #3, #6, #9, #12, #15 in the case of PE1. Then, the edge of word-clk that operates Registers 3 and 4 should also occur accordingly. On the other hand, BLMAC in PE2 produces results after 4th, 7th, 10th, 13th, 16th bit-clk cycles, and the PE3 produces results after 5th, 8th, 11th, 14th bit-clk cycles. Therefore, the three wordclks that operate PE1, PE2, and PE3 have a phase difference as much as one bit-clk period over that of the preceding PE.
If PE1 starts the processing of the second row of IFMap (i.e., I 1,0 to I 1,15 ) in bit-clk cycle #17, then the first BLMAC result of the second row is generated after bit-clk #19. Since the word-clk for PE1 is synchronized with the bit-clk at Clk #3n in Table 3, the results generated after bit-clk #19 cannot be taken from Registers 3 and 4. Therefore, the operation of the second row should start from the 19th bit-clk, and the BLMAC result should be scheduled to appear in the 21st bit-clk. This output value is added to the one stored in bit-clk cycle #3 by CP2 and RCA1 (shown in Fig. 6), and stored back in the shift register. After bit-clk #39, a total of 9 MAC operations of 3×3 are completed, resulting in the first OFMap pixel, O 0,0 . This operation continues and the last pixel in the first row O 0,13 is generated during the 52nd bit-clk cycle.
The proposed dataflow for the subset of N = 28 is shown in Table 4 considering that the number of IFMap pixels in a row is less than N . As a result, the convolution process of the first row of IFMap with the first row of the kernel gets completed using the first 14 neurons. Since 28 neurons are available, to avoid wasting of the rest of the neurons, we use them for the convolution of the second row of IFMap (i.e. I 1,0 to I 1,15 ) with the first row of the kernel. Then, the output pixels of the first row of OFMap (i.e. O 0,0 to O 0,13 ) are generated between the bit-clk cycles #75 and #88.
The length of the shift-register in each PE depends on the value of N . The value of M in Fig. 6 shows the length of the shift register according to each N value. As can be seen from the figure, as N increases, the length of the shift register increases, so the silicon area also increases. On the other hand, as the value of N increases, the number of cycles to complete the convolution process decreases. Appropriate N value can be selected to have the desired trade-off between cost and speed depending on the design objective.

A. CRITICAL PATH ANALYSIS
The proposed structure follows a hierarchical design using two clock sources, the bit-clk and word-clk, and based on precise critical path analysis, we set the timing constraint to T word−clk = 3T bit−clk . Therefore, precise analysis of the critical paths is required to set appropriate timing constraints for each of the clocks. In this paper, we aim at a high throughput implementation of CNN, and to achieve that it is essential to reduce the T bit−clk by minimizing the critical path delay of BLMAC. However, if the critical path of the circuit operated by word-clk (the circuit outside the BLMAC VOLUME 1, 2020   in Fig. 6) becomes long, the optimization of the BLMAC in terms of throughput becomes ineffective. Therefore, to achieve high throughput we minimize the T bit−clk , and the T word−clk is designed to be not more than 3T bit−clk . A critical path delay model is accordingly evolved to support proposed strategy for high-throughput implementation of CNN.
Modeling the critical path and accurately predicting propagation delays are not straightforward. In particular, since the propagation delay of each gate varies with the temperature and the load capacitance, the actual circuit delay is inevitably different from the delay obtained through the analysis. It is also difficult to predict exact values of delay of the synthesized design because synthesis tool sometimes does minor changes in the selection of library cells during various optimization stages. In order to overcome the problem associated with evaluation of the exact propagation delay value, in this paper, propagation delay is not expressed as absolute value in ns, but as a relative value as multiple of gate delay (nT G ). During the analysis, we check if the critical path obtained through synthesis report matches our prediction, if the critical path is effectively reduced, and if the timing constraint of T word−clk = 3T bit−clk is not violated.
Let us first analyze the critical path in BLMAC operating with bit-clk (shown in Fig. 3). The critical path for MBE is the path that computes α 0 (the 17th bit of the first partial product in Fig. 4). When the process of obtaining α 0 from Algorithm 2 is traced back, it can be seen that the delay is the sum of the delays of 2 XOR gates, 1 AND gate, and 1 OR gate. To simplify subsequent analysis, let us define the unit gate delay as T G , and assume that the delays for XOR, AND, and OR gates are 2T G , T G , and T G , respectively. Then, the total propagation delay of MBE becomes 6T G . Note that this critical path delay is fixed regardless of the input bit-width of the MBE.
Since both WRT and CP1 in Fig. 3 are implemented only with FA and HA, it is also necessary to estimate their delays using T G . The propagation delay from each input to output of FA and HA are summarized in Table 5. A denotes one of the pair of inputs and the output sum and carry-out bits are denoted by S and C OUT , respectively of FA and HA. C IN denotes the carry input of FA. Note that the delay of XOR gate T XOR is taken as 2T G and the delay of AND gate is taken as T G .

Adder
Path The operation of FA in stage of the WRT is not affected by the results of other FAs or HAs in the same stage. Since the FAs and HAs belonging to the same stage can operate simultaneously, the propagation delay of one stage of WRT cannot be greater than the propagation delay of FA. The delay of n-stage WRT therefore can be expressed as where n is the number stages of WRT. The delay of 4-stage WRT shown in Fig. 5(a) can be estimated as 16T G . Similarly, since the number of stages in CP1 shown in Fig. 5(b) is 2, its delay becomes 8T G . The propagation delays of different components of BLMAC and the total propagation delay of BLMAC are listed in Table 6. If the propagation delay of the register is ignored, it takes a total of 30T G to perform one MAC operation. Therefore, if the period of bit-clk is set to be greater than 30T G , the timing constraint would not be violated.

Unit
Delay

Unit
Delay To analyze the critical path of the proposed structure operated by the word-clk, it can be seen that the single stage CP2 has 1 FA delay, that is 4T G as shown in the dot diagram of Fig. 7(a). In addition, the delay of RCA1 can be calculated as where T FAC is the delay from A to C OUT of FA in bit position 2 in Fig. 7(b). The second term 32T FCC is the delay from  Table 5, the delay of RCA1 in (3) can be expressed as 79T G . Meanwhile, of the two inputs of RCA2, the correction vector can be computed in advance. Once the output S of RCA1 is available at any bit position, the full adder in the same bit position can start the operation. That is, if the output S in the most significant bit position of RCA1 is available, the operation of RCA2 can be completed after one full adder delay of T FAS . Therefore, the propagation delay of RCA2 would be 4T G . The sum of the delays for CP2, RCA1, and RCA2, which is equal to 87T G as shown in Table 7, amounts to the critical path delay of the second stage of the PE governed by the word-clk.
As mentioned earlier, one word-clk period is set to 3bitclk periods. If the bit-clk period is set to 30T G as in Table 6, the word-clk period can be set to 90T G which is 3 times bitclk period. Note that this setting does not violate the timing constraint since the delay of the critical path using word-clk is only 87T G , which is shorter than the word-clk period. VOLUME 1, 2020

B. DESIGN INNOVATIONS FOR HIGH-SPEED IMPLEMENTATION
A sequential multiplier requires several clock cycles to perform one multiplication. On the other hand, the Booth algorithm and WRT-based parallel multiplier used in this paper can generate multiplication results every clock cycle, which is essential to achieve high throughput rate. We have used the modified Booth algorithm to effectively reduce the number of PPs in WRT and avoid sign extension, thereby effectively reducing the area as well as the computation-time.
Although a parallel multiplier can perform a multiplication in each clock period, its overall delay is limited by the vectormerging addition of the sum word and the carry word at the last stage of WRT. Since CNN accumulates the results of multiplication multiple times during a convolution operation, it involves a MAC operation with feedback path, making it difficult to utilize pipelining to speed-up the processing. The implementation of pipelining in the feedback path for the MAC operation in the proposed architecture is a critical design challenge but opens up potential for dramatic improvements in speed which is discussed in the following.
• The logic for MAC operation of VGG16 is designed in a hierarchical structure. Specifically, BLMAC in Fig. 6 is driven by bit-clk and the other components are subsequently operated by word-clk. Through this hierarchical design, the effect of realizing a two-stage pipelined structure in which pipeline registers are inserted after each MAC operation could be obtained. Since VGG16 is based on a 3×3 convolution filter, the proposed architecture is designed in such a way that three consecutive MAC operations become the first pipeline stage, and their accumulation is performed in the second pipeline stage. • In parallel multipliers, the critical path is lengthened due to the vector-merging addition of the last two reduced words (corresponding to the sum bits and carry bits at the last reduction stage). To overcome this, the last two reduced words are not added but stored separately in two registers. In the subsequent clock cycle, another pair of words are generated and together with the pair of words stored in the registers in the previous cycle form a set of 4 words. This set of 4 words are reduced by a 4-to-2 compressor to produce two reduced words, and prevent an increase in the number of reduced words. Also, two pipeline registers driven by word-clk are additionally placed to switch the clock domain between the two pipeline stages. Although there is an additional cost of area loss due to the use of additional resistors, the speed can be greatly improved by this design strategy. • Critical path analysis is performed so that the propagation delay of the second pipeline stage is three times that of the first, considering the size of 3×3 convolution filter. For this purpose, the propagation delay of each pipeline stage is expressed as the number of gate delays. This allows us to design two pipeline stages for maximum utilization of hardware by suitable choice of the duration of the clock period and maximization of throughput rate. • The error caused by not adding the partial result pairs P P 1 and P P 2 in BLMAC, is corrected. Specifically, as shown in Fig. 6, it is designed to correct errors using MUX regardless of the size of the convolution filter. Since RCA2, an adder for error correction, is placed adjacent with RCA1, only 4T G amount of additional delay is incorporated by this adder as shown in Table 7. Therefore the proposed architecture increases the throughput rate without compromising with the accuracy.

A. SYNTHESIS RESULTS
The proposed architecture is coded at the register transfer level (RTL) using VHSIC hardware description language (VHDL). It was then synthesized by the Synopsys Design Compiler with TSMC 65-nm CMOS standard cell library [24]. During synthesis, the input delay and output delay are set to 0.01 ns. The timing constraint is adjusted to find the minimum bit-clk period while the slack was maintained positive, and the word-clk period is then set to three times the minimum bit-clk period. The power consumption is estimated at the clock period which is set to the reciprocal of the maximum usable frequency. We have compared the hardware complexity by implementing circuits where the values of the N were 7, 14, 28, 56, and 112, respectively. In order to demonstrate the effectiveness of the proposed architecture, the state-of-the-art implementation [11] is used for comparison. According to [16], the word lengths of input and kernel weights are quantized to 16-bit, reaching the efficiency of precision and the area. The partial results of the MAC are stored in a 48-bit shift register to prevent overflow even with the maximum convolution size. The performance of the proposed architecture as obtained from the synthesis results and those of the reference architectures in terms of data arrival time (DAT), maximum usable frequency (MUF), latency, area, area-delay-product (ADP), power consumption, and power-delay-product (PDP) are listed in Table 8. The values of pADP and pPDP are also shown in the table to show the percentage improvement of ADP and PDP compared to the reference architecture, respectively.
It is interesting to find from the synthesis results that the DAT of the proposed architecture is 1.04 ns, regardless of the value of N . The DAT is the critical path delay of BLMAC working with the bit-clk shown in Table 6. To take full advantage of this short DAT, the bit-clk period can be set to the same value as the DAT, thus the MUF up to 961 MHz can be used. The DAT of the proposed structure is only 41.6% compared to the one of the reference architecture, demonstrating the acceleration performance of the proposed structure. In the proposed structure, as the value of N increases, the size of the shift register increases, resulting in an increase in the silicon area. Also, the area of the proposed  structure is reduced compared to the reference when N is 7, 14, and 28, but larger when N is 56 and 112. However, since the DAT is greatly reduced, the proposed architecture has an advantage in terms of ADP for any of the values of N . Specifically, ADP can be reduced by 52.2% to 69.2% depending on the value of N . The power consumption of the proposed architecture is also less than that of the reference architecture, except for N =112. Although power consumption varies depending on the clock period, the power-delay-product (PDP) obtained by multiplying power consumption and the delay tends to remain constant. Therefore, in this paper, PDP is used as a comparison metric with existing structures. Specifically, the proposed architecture has at least 54.2% less PDP than the reference architecture. Therefore, it can be seen that the proposed structure not only provides significantly better efficiency in terms of area-delay product but also the power consumption.
Power consumption is affected not only by the frequency of the operating clock, but also by the toggle rate (T R) and the probability that the input is in logic state 1 (P 1 probability). Most papers provide the power estimated at operating clock frequency, but do not provide the T R or P 1 values, thus it is difficult to make a fair comparison with other papers in terms of power consumption. In this paper, we set P 1 =0.1 and T R=0.1 times clock frequency, which are reasonable because they are the default values set by general synthesis tools. In the proposed architecture, the power consumption can be effectively reduced by gating of the RCA2 input used for the correction at the end of the overall calculation of the 3D kernel.
According to the dataflow of the proposed architectures, the delay of each convolution set of the reference and proposed architectures can be expressed as  Tables 9 and 10. It can be seen that the latency of both structures decreases as the value of N increases. In particular, since the clock period of the proposed structure is significantly shorter than that of the reference architecture, the latency required to complete the operations in entire convolution layers is also significantly reduced.

B. APPLICATION OF THE PROPOSED ARCHITECTURE
The purpose of the proposed design is to achieve significant increase of throughput using gate-level pipelining. When the proposed architecture is intended to be used as a computing core for an existing neural network, it can be flexibly adjusted according to changes in the kernel size, bit width, and N value. The flexible realization of proposed architecture to fit in to the design specifications of CNN as follows: • For shorter bit-width implementation, pruning can be applied to the Wallace tree in Figs. 5 and 7 of the proposed structure, and conversely, for a longer bit width implementation, bit extension can be applied to the Booth encoder and the Wallace tree. • For a generic W ×W convolution kernel other than 3×3, in the BLMAC in Fig. 3, the transfers from Registers 1 and 2 to Registers 3 and 4 are to be performed after W times accumulation instead of 3 times. When W is greater than 3 and the cumulative count is incremented, sign-extension is applied in the Wallace tree to prevent overflow. In this case, the timing constraint is not violated if the period of word-clk is set to W ×bit-clk. • If the value of N changes, the size of the shift-register in Fig. 6 can be changed accordingly.
Based on the throughput requirement of a specific CNN, the parallelism of the convolution engine can be modified as shown in Fig. 8. Let us denote the parallelism index as VOLUME 1, 2020  [11] for input image of size 224×224.

Reference VGG16
C   p. Each CE produces the OF map of one output channel. Also, each CE has the same IF map but receive different input W eights from different input filters. In the parallel architecture, latency of computation is decreased by a factor p, however the area and power consumption are increased by almost the same factor p. The synthesis results of proposed parallel architecture with p = 32 and some of the existing CNN accelerator architectures are shown in Table 11.
The architecture of Moons et al. [25] has been synthesized using a more recent fully depleted silicon on insulator (FD-SOI) technology library than the others. The work of [25] uses the dynamic fixed-point to perform CNN with flexible bit-width ranging from 1 to 16 bits based on the precision requirement of the application. When running VGG16, this architecture provides a throughput of 1.67 fps and consumes just 26 mW power, but our design involves 13.03% less gate-counts and provides 3.53 times higher throughput. The proposed hardware accelerator has 2.57 times higher throughput using about 1.77 times less PDP than the MAefficient design of [11]. It is also worth mentioning that the proposed accelerator provides the peak performance at 181 Giga operations per second (GOPS) which is 2.38 times higher than that of [11]. The architecture of Chen et al [18] which is implemented in 65 nm CMOS technology involves 1.67 times less power consumption using data compression and network sparsity techniques, but the proposed CNN hardware accelerator outperforms the work of [18] in terms of throughput (8.4× faster), latency (25.5× lower), gate count (8.4% less) with (3.8× higher) nominal frequency.

VII. CONCLUSION
In this paper, we have proposed a novel MAC and PE architecture for high-speed hardware implementation of the CNN accelerator. The proposed structure can greatly improve the speed by novel bit-level design using modified two stage Wallace reduction and modified Booth recoding by reducing the computational delay of MAC operation. The proposed design works with dual-clock strategy where the MAC operations are accelerated with a faster clock while their accumulation for convolution operation operates with a longer clock period, in a two-stage hierarchical design. Precise critical path analysis is performed to set timing constraints appropriately in the hierarchical structure. The inevitable errors due to the hierarchical addition, the correction vectors were calculated on the sideline concurrently and added to compensate for the errors. Dataflow and latency analysis of the proposed structure have been demonstrated for the convenience of the readers. The proposed structure has been experimentally proven to be efficient in terms of area, throughput, and power consumption by showing that it outperforms the existing architecture in terms of ADP and PDP. Therefore, the proposed structure is suitable for performing CNN in high-speed real-time processing applications. In addition, the proposed structure can be applied to various CNN algorithms with different network architectures based on MAC computation.