Energy-Efficient Precision-Scaled CNN Implementation With Dynamic Partial Reconfiguration

A convolutional neural network (CNN) classifies images with high accuracy. However, CNN operation requires a large number of computations which consume a significant amount of power when implemented on hardware. Precision scaling has been recently used to reduce the hardware requirements and power consumption. In this paper, we present an energy-efficient precision-scaled CNN (EEPS-CNN) architecture. Furthermore, the Field Programmable Gate Array (FPGA) is reconfigured during run time using Dynamic Partial Reconfiguration (DPR). If the battery level decreases, the EEPS-CNN design with the most appropriate power consumption is configured on the FPGA. DPR enables recognition applications to run at a low power budget while sacrificing minor accuracy instead of termination. The proposed architecture is implemented on Xilinx XC7Z020 FPGA and is evaluated on three datasets: MNIST, F-MNIST, and SVHN datasets. The results show a 2.2X, 2.39X, and 2.38X reduction in the energy consumption, respectively, while using only 7 bits to represent all inputs and network parameters. The accuracy of the proposed EEPS-CNN is only 0.53%, 3.67%, and 0.88% less than 32-bit floating-point architectures for MNIST, F-MNIST, and SVHN, respectively. Moreover, the results show up to 92.91X and 4.84X reductions in the power and energy consumption of the proposed EEPS-CNN compared to related designs developed for the MNIST dataset.

proposed design. In [17], the hls4ml library was used to 94 implement a neural network that was trained to recognize the 95 MNIST dataset on FPGA. 96 In [18], ResNet-18 was implemented on Xilinx 97 XC7VX690T, and 16 bits fixed-point arithmetic was used. 98 In [19], the Alex network was implemented on ZYNQ-702 99 FPGA, and Vivado 2015.1 tool was used for synthesis. This 100 design was tested with and without pipelining to assess the 101 time and power consumption. In [20], a hardware accelera-102 tor design was developed to recognize the MNIST dataset. 103 Python programming language was used to model the pro-104 vided deep neural network, and the function of Register 105 Transfer Level (RTL) was tested on ModelSim. Finally, 106 the architecture was implemented on Xilinx Zynq ZC-702. 107 In [21], hardware-software co-design was developed for neu-108 ral network applications on the PYNQ-Z2 board. For this 109 aim, convolutional IP cores were implemented, and they were 110 used as Python overlays. In addition, the convolutional IP 111 core was used to accelerate the recognition of the MNIST 112 dataset.

113
A new structure for binary convolution was proposed 114 in [22] with the aim of decreasing the consumed power and 115 the hardware resources. In addition, full-BNN (Binary Neural 116 Network) and mixed-precision BNN were proposed. Finally, 117 the MNIST dataset was used to test the two proposed neural 118 networks on the DSP + Xilinx 352T FPGA board. In [23], 119 a new approach for implementing a Fully Connected Deep 120 Neural Network (FC DNN) and convolutional neural net-121 work on FPGA was proposed. For the FC DNN, a mini-122 mum number of computational units was used, while for 123 the CNN, parallel processing, as well as systolic architec-124 ture were exploited. A CNN architecture was trained while 125 using different floating-point formats in [24]. In addition, the 126 MNIST dataset was used to verify the proposed accelerator 127 engine on FPGA. Verilog was used to implement the pro-128 posed design in RTL, whilst it was verified by Vivado Sim-129 ulator. In [25], LENET-5 CNN was implemented on FPGA. 130 Moreover, the CNN was accelerated using the parallelization 131 of the operations. The MNIST and other datasets were used 132 to evaluate the proposed design. 133 Furthermore, partial configuration was utilized to over-134 come the FPGA resource limitations. This design was 135 tested on three datasets, namely, CIFAR-10, CIFAR-100 and 136 SVHN. A CNN was implemented on Intel Cyclone 10 FPGA 137 to recognize MNIST dataset's handwritten digits in [26]. 138 Fixed-point representation was used to represent all the net-139 work weights and all the intermediate operations. Xilinx 140 XC7A100T FPGA was used to implement a CNN to recog-141 nize the MNIST dataset in [27]. The CNN was trained using 142 MATLAB 2018. Multiplication and addition operations were 143 performed using fixed-point representation.

151
In this paper, we present the energy-efficient precision-

173
The remainder of the paper is organized as follows.

174
Section II presents the proposed EEPS-CNN architecture.

175
The details of the hardware architecture developed to imple-176 ment the proposed EEPS-CNN is discussed in Section III.

177
The experimental results are presented in Section IV. Finally, 178 we conclude the paper in Section V. 181 In this section, we present the proposed energy-efficient CNN 182 architecture and how precision scaling is used to reduce its 183 energy consumption. (1)

202
At the pooling layer, the input is divided into rectangu-203 lar regions. Then, an average or a maximum of each region 204 is generated at the output. The pooling layer is used to 205 down-sample the input representation. This reduces the com-206 putational complexity and the memory usage of the network. 207 The pooling layer is described by its kernel size and stride. 208 In fully connected layers, the neurons are fully connected by 209 different weights.

210
In this paper, we propose the energy-efficient precision-211 scaled CNN (EEPS-CNN) architecture which is structured as 212 two alternating convolutional and pooling layers, followed 213 by two fully connected layers as shown in Figure 1. The 214 proposed EEPS-CNN architecture is then used to design 215 three CNNs for the three considered datasets resulting in 216 the (EEPS-CNN-1) for MNIST dataset, 1 (EEPS-CNN-2) for 217 F-MNIST dataset, and (EEPS-CNN-3) for SVHN dataset. 218 The MNIST dataset is handwritten digits with a dimension 219 28 × 28 gray scale [33]. The F-MNIST (Fashion MNIST) 220 is a dataset of Zalando's article images with a dimension 221 28×28 gray scale [34]. The SVHN (Street View House Num-222 bers) dataset is a real-world image dataset that is obtained 223 from house numbers in Google Street View images [35]. The 224 SVHN dataset is a real-world problem of recognizing digits 225 and numbers. SVHN is 32 × 32 red-green-blue (RGB) color 226 images.

227
The designs of the proposed EEPS-CNN architecture have 228 the first convolutional layer with 2 filters for the MNIST 229 dataset and 4 filters for the F-MNIST and SVHN datasets, 230 with a dimension 3 × 3 and a stride length of one. The input 231 image to the input convolution layer is padded to preserve 232 its spatial size. The second convolutional layer has 4 fil-233 ters for the MNIST dataset and 8 filters for the F-MNIST 234 and SVHN datasets. Each convolution layer is followed by 235 a ReLU activation function. A 3 × 3 filter dimension is 236 used because a small filter size captures the fine details of 237 the image, while a bigger filter size leaves out small details 238 in the image. The selected 3 × 3 filter dimension needs a 239 small number of multiplications, which reduces the power 240 consumption of the hardware implementation. The pooling 241 layer is MaxPool with a dimension 2 × 2 and a stride length 242 of two. The first fully connected layer has 20 neurons for 243 the MNIST dataset and 256 neurons for the F-MNIST and 244 SVHN datasets and is followed by a ReLU activation func-245 tion. The second fully connected layer has 10 neurons and 246 is followed by softmax. Softmax is another activation func-247 tion that is applied to the CNN layer [36]. Softmax converts 248 the output of the last layer of the CNN into a probability 249 distribution.

250
The network parameters and the number of multiplications 251 for each layer in each EEPS-CNN design are shown in Table 1     accumulation operations. Each layer output is examined to 297 find the maximum and minimum values to decide the number 298 of bits needed to represent the integer part m as shown in 299 Figure 3 using (2) to avoid overflow, which may happen from 300 accumulation or to preserve unneeded bits for the fraction 301 part.

302
The test function applies repetitive truncation after each 303 multiplication and addition step, as shown in Figure 5. Such 304 repetitive truncation guarantees that the numbers generated 305 from the software test functions can be represented with 2n 306 bits. In addition, it guarantees that the fraction part can be rep-307 resented using 2n − m − 1 bits. Finally, the last accumulation 308 result, which is represented by 2n bits, is truncated to n bits. 309 Truncation is needed to use the same hardware with the same 310 bitwidth for the following layer operations.

313
In this section, we present the hardware architecture that 314 we develop to implement the proposed EEPS-CNN. Then, 315  we discuss how we apply dynamic partial reconfiguration to 316 the developed architecture.  The computation and ADD2 units are intended for perform-339 ing multiplications and additions on the convolutional and 340 fully connected layers. The computation unit contains three 341 processing elements (PEs) as shown in Figure 7. A single 342 PE consists of nine multipliers and four adders as shown 343 in Figure 8. Therefore, three vector multiplication-addition 344 operations are performed at the same time by three processing 345 elements. Figure 8 is a data flow graph (DFG), which demon-346 strates how the functional units are reused in each cycle. Ini-347 tially, the operands are multiplied using the nine multipliers. 348 The results are then added using the four adders. After that, 349 some of these adders are reused to finish the addition oper-350 ations. The final accumulation is achieved using ADD 2-3 351 Units as shown in Figure 6, which contain other three adders, 352 VOLUME 10, 2022      of the dynamic part for the MNIST dataset in the case of 427 16 bits, and 5 bits, respectively. The static part is configured 428 using a full bit-stream at the boot time, while the dynamic 429 part is configured using partial bit-streams at the run time. 430 The dynamic part consists of one or more reconfigurable 431 partitions (RPs). Each RP is reconfigured with different par-432 tial bit-streams without changing the static part. Sharing the 433 same programmable logic between multiple Reconfigurable 434 Modules (RMs) reduces the needed hardware resources. The 435 reconfiguration of the system from an operating design to 436 another needs a reconfiguration time that is a significant fac-437 tor in DPR. The reconfiguration time is proportional to the 438 size of the partial bit-stream, which is proportional to the size 439 of the reconfigured region.

440
For the implementation of the proposed EEPS-CNNs, the 441 used FPGA platform is reconfigured with the appropriate 442 power level design during run-time using DPR. Figure 13 443 shows the block diagram of the developed DPR system. The 444 required partial bit-streams are transferred from DDR to the 445 ICAP by a processing system (PS). Then, the ICAP reconfig-446 ures the RPs. According to the available power at the battery, 447 the required partial bit-streams are determined.

449
In this section, we first use Python programming to train 450 and test the performance of the three proposed EEPS-CNN 451 designs. Then, we implement them on an FPGA platform to 452 evaluate their accuracy and hardware characteristics.     Table 2.

512
The resulting accuracy and accuracy loss are listed in 513   Table 3. The accuracy loss is the difference between the accu-514 racy obtained for the 32-bit floating-point operation (which is 515 given as a percentage) and the accuracy obtained for the n-bit 516 fixed-point operation (which is also given as a percentage).

517
Consequently, the accuracy loss is given as a percentage that 518 is the difference between the two percentages calculated as for the F-MNIST design the accuracy loss is negligible to 527 8 bits.

529
Here, we present the hardware setup used to implement and 530 evaluate the performance of tested EEPS-CNNs. The hard-531 ware architecture is modeled by VHDL language, designed 532 using Xilinx Vivado (v.2015.2), and implemented on a 533 Zynq-7000 evaluation board which contains xc7z020clg484-534 1 FPGA. The proposed hardware architecture is synthesized 535 to recognize the MNIST, F-MNIST, and SVHN datasets. Fig-536 ure 18a, Figure 18b, and Figure 18c show the floor plan-537 ning of MNIST, F-MNIST, and SVHN in the case of 16-bits, 538 respectively.

539
The FPGA resource utilization for the MNIST is shown 540 in Table 4         SVHN EEPS-CNN implementation, Table 5 and Table 6 560 summarize their FPGA resource utilization. For our three  which enable the use of a smaller number of multipliers which 590 saves more energy and power compared to the existing archi-591 tecture (such as [12]) while achieving reasonable recognition 592 time.

593
The proposed hardware architecture achieves energy 594 reductions for the 12, 10, 8, 7, 6, and 5 bits cases compared to 595 the 16-bit case. As the number of bits decreases, the switch-596 ing activity decreases, and hence, the consumed power and 597 energy decreases. More specifically, the energy reductions 598 for the MNIST Figure 19. Moreover, the proposed EEPS-CNN 605 for the MNIST dataset in the case of 16-bit achieves 92.91X 606 and 4.84X reductions in the power and energy consump-607 tions compared to [12] as the consumed power and energy 608 are 33 mW and 9.05 µJ, respectively, as shown in Table 8. 609 Whereas, the power consumption of the design presented 610 in [12] is 5 W with static power and 3.066 W without static 611 power. In addition, the energy consumed by the design pre-612 sented in [12] is 71 µJ with static energy and 43.8 µJ without 613 static energy. Moreover, the power and energy consumption 614 reductions compared to [27] are 29.55X and 4.42X, respec-615 tively. Table 9 compares the power, energy per image and 616 the recognition time for the proposed EEPS-CNN design 617 for the MNIST dataset in the case of 16-bit as well as 618 for ANN [12] and CNN [27]. The power reduction of the 619 proposed EEPS-CNN is higher than the energy reduction 620 because the recognition time of the proposed EEPS-CNN 621 design is slightly higher than that of both [12] and [27] 622 (19.18X and 6.69X, respectively).

624
As mentioned in Section III, DPR is used to reconfigure the 625 FPGA during run-time using the design with the most appro-626 priate power level. The ICAP processor is used to reconfigure 627 the FPGA during the run-time. The throughput of the ICAP 628 processor is 10 MBps. Hence, the reconfiguration time is 629 given by: The partial bitstream file size is equal to 1.27 MB for 632 MNIST implementation and equals to 2.15 MB for both the 633 VOLUME 10, 2022   and energy consumptions while having less than a 1% loss 652 in accuracy compared to existing hardware implementations. 653 We have further exploited DPR to reconfigure the FPGA 654 with the design with the most appropriate power level during 655 the run time if the battery level is decreased. Such DPR has 656 ensured continuity instead of termination at the expense of 657 image recognition accuracy.

658
Finally, it is worth mentioning that the uniform quan-659 tization method optimized for the widely used MNIST, 660 F-MNIST and SVHN datasets in this paper can be applied 661 for other CNN architectures in which the difference in 662 the sensitivity of the CNN layer is not significant. How-663 ever, CNN architectures in which different layers have 664 different sensitivities, non-uniform quantization might be 665 needed. Our future work will investigate the generalization 666 of quantization for other networks and datasets while relat-667 ing the CNN layers' sensitivities to the used quantization 668 approach.

670
The authors would like to acknowledge the support of the 671 Cloud Computing Center of Excellence at the Electronics 672 Research Institute (ERI) in Egypt and for providing access the 673 center's Cloud and High Performance Computing facilities to 674 conduct the research presented in the paper.