A Configurable Architecture for Running Hybrid Convolutional Neural Networks in Low-Density FPGAs

Convolutional neural networks have become the state of the art of machine learning for a vast set of applications, especially for image classification and object detection. There are several advantages to running inference on these models at the edge, including real-time performance and data privacy. The high computing and memory requirements of convolutional neural networks have been major obstacles to the broader deployment of CNNs on edge devices. Data quantization is an optimization method that reduces the number of bits used to represent weights and activations of a network model, minimizing storage requirements and computing complexity. Quantization can be applied at the layer level, by using different bit widths in different layers: this is called hybrid quantization. This article proposes a new efficient and configurable architecture for running CNNs with hybrid quantization in low-density Field-Programmable Gate Arrays (FPGAs) targeting edge devices. The architecture has been implemented on the Xilinx ZYNQ7020/45 devices and is running the AlexNet and VGG16 networks. Running AlexNet, the architecture has a throughput up to 508 images per second on the ZYNQ7020 device, and 1639 images per second on the ZYNQ7045 device. Considering VGG16, the architecture delivers up to 43 images per second on the ZYNQ7020 device, and 81 images per second on the ZYNQ7045 device. The proposed hybrid architecture achieves up to $13.7\times $ improvement in performance compared to state-of-the-art solutions, with small accuracy degradation.


I. INTRODUCTION
Convolutional neural network (CNN) is a type of deep learning network used for image classification [1] and many other applications in computer vision. The CNN model mimics the structure of the human brain, with the neurons organized in a series of layers, and whose interconnections have associated weights that enhance specific characteristics of the image. A training process is used to determine the weights, after which the network is able to classify images other than those The associate editor coordinating the review of this manuscript and approving it for publication was Ting Li . used during the training process. A trained network will classify a new image as belonging to one of the classes considered in a process known as inference.
CNNs demand large storage and computational resources and therefore are typically run on high-performance computing platforms. However, a vast application set can be enabled by running CNN models on edge devices [2]. These smart embedded devices can take real or almost real-time decisions based on the analysis of locally collected data.
The fact that edge devices are normally resourceconstrained, makes it difficult to run CNN models on them with acceptable accuracy and response time. To solve this VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ problem, many optimization methods have been devised and applied to network models and architectures in order to reduce the computation complexity without significant accuracy degradation. Data quantization is one of these methods. Data quantization reduces the computational weight by using fixedpoint instead of floating-point hardware and smaller data words. It has been shown that using different data sizes in different layers can considerably reduce the hardware complexity, improve the performance, and still keep an acceptable accuracy of the results. This method is called hybrid data quantization [3].
Designing a dedicated architecture to run CNNs with hybrid data quantization is challenging since different networks will require different data quantizations in different layers. Hence, the devised architecture must be flexible enough so that different data sizes in different layers may be configured. A few works have used Field-Programmable Gate Arrays (FPGAs) to implement hybrid architectures [3], [4], since the FPGA can be reprogrammed for each specific network. However, these approaches considered a complete pipeline structures, with a dedicated module for each layer, and their size only allowed mapping on medium to large FPGAs.
In this paper a different approach is proposed. The proposal consists in using modular hybrid computational cores, which allow further reducing the data sizes and storage requirements of the configurable architecture, while improving the inference run time. Such architecture can efficiently run CNNs in low-density FPGAs, enabling smart embedded computing [5]. The proposed hardware modules can efficiently run dot-products of different data sizes in the FPGA, and effectively support hybrid data quantization. The experimental results show throughput improvements of up to 13.7× when compared to state-of-the-art approaches that tackle the same problem.
The paper is organized as follows. Section II describes related work on CNN FPGA implementations and optimization methods. Section III describes the baseline architecture without hybrid quantization. Section IV describes the proposed architecture for hybrid CNNs. Section V describes the area and performance results of the architecture running wellknown CNNs. Section VI concludes the paper.

II. RELATED WORK
The ability of CNNs to classify images comes from the convolutional layers that run 3D convolutions of weight kernels and input feature maps (IFM). Each 3D convolution produces an output feature map (OFM) to be used by the next layer. Different kernels are used to identify different characteristics of the image stored in different output feature maps. Successive layers correlate more complex features, allowing the identification of particular classes of objects. Complex features are finally fully correlated with fully-connected layers, where all neurons are connected to all neurons of the previous layer. A final layer called softmax outputs the result of the inference process. The softmax layer uses a node for each class of object and produces the relative probability of the input image belonging to the object class. In all layers the neuron outputs are driven by an activation function. Several activation functions have been proposed but recently the Rectified Linear Unit (ReLU) function and its variations are the most commonly used due to their simplicity, advantages during training and good inference results.
The size of the layers, number of kernels and the interconnections between layers determine the accuracy of the network for each class of classification problems. As a consequence, many different CNNs have been proposed in the last years for improving the accuracy in different sets of images. LeNet [6] was one of the first CNNs used for classification of hand-written digits in 32 × 32 images, and achieved a very high (over 99%) accuracy. LeNet used 2 convolutional and 3 fully-connected layers, in a total of 60K weights.
The evolution of high-performance computing platforms made possible the training of networks with 1000× more weights than those used in LeNet. This way, larger images could be classified. The AlexNet network [7] has 5 convolutional and 3 fully-connected layers, in a total of 61M weights. This increase in the number of weights required much more memory and many more operations (1.5 GOp) to process a single image. AlexNet won the 2012 ImageNet Challenge (for the classification of images of size 224 × 224 × 3 in 1000 different classes), achieving top-5 and top-1 error rates of 17% and 37.5%, respectively.
The achievements of AlexNet have opened the road to new, larger and more accurate networks. VGG16 [8], a version of VGG with 16 layers, has 2.2× more parameters than AlexNet, and executes 31 GOp to run inference over an image. This large model achieved a top-5 error rate around 10% in the 2014 ImageNet Challenge. GoogleNet [9] is an irregular CNN with 22 convolutional layers and a new type of group layer (the inception layer consisting of parallel convolutions). The model needs 7M parameters and 1.58 GOp and achieved a top-5 error of 7% in the 2015 ImageNet Challenge. ResNet [10] introduced a new composite module containing an identity connection to reduce the complexity of the training. Several versions of ResNet have been designed with a number of convolutional layers ranging from 53 to 155. With up to 11.3 GOp of workload, ResNet reduced the top-5 error from 6.7% to 5.8%. ResNet was the first CNN exceeding human-level accuracy in the ImageNet Challenge.
CNNs, like any other deep neural network, are very compute-intense and therefore are typically run on highperformance platforms. However, high-performance devices are not appropriate for smart embedded computing, since they are energetically inefficient and too expensive. Current embedded Central Processing Units (CPUs) achieve only a few dozen Giga FLoating-point Operations per Second (GFLOPS) with insufficient power efficiency for realtime or almost real-time processing of CNNs. Graphics Processing Units (GPUs) offer thousands of GFLOPS at the cost of high energy consumption, disqualifying them for embedded computing. Finally, dedicated high-performance units (Application-Specific Integrated Circuits) are quite high-performance and energy-efficient, but they are not programmable and unable to map different networks or adapt to the constant evolution of deep neural networks.
Reconfigurable computing devices, in particular FPGAs, are used to implement accelerators for CNN inference [11], [12]. They are more energy-efficient than GPU and CPU based-platforms and achieve better performance than CPUs. The reconfigurability of FPGAs allows designing a dedicated datapath for a specific CNN model [13], and efficiently implement optimization strategies. Designing a dedicated accelerator for CNN inference requires some hardware expertise when following the traditional FPGA design flow. To speed-up the design process, a few works have proposed automatic frameworks and high-level synthesis (HLS) tools that automatically convert a specification of the network model into a dedicated FPGA architecture [14]. HLS tools allow the fast design of architectures but are not optimized for best performance and resource usage. A template-based approach considers configurable cores or architectural templates that can be configured for a particular network [15]. Solutions obtained from an architectural template are more optimized but less flexible. A mid-term solution considers an overlay processor [16], which is a programmable hardware structure that runs a compiled CNN model, but adds some overhead associated with the overlay. Since our proposal targets constrained devices, it follows a template-based approach, where a generic and configurable architecture can be tailored for each specific network and target device.
Initial FPGA implementations of small CNNs [17], [18] have been followed by FPGA implementations of all layers of a CNN model [19], [20]. CNN models are made of different types of layers, and each layer has filters and maps of different sizes. Therefore, the architectures proposed are flexible enough to run both fully connected and convolutional layers of different shapes and sizes. One of the main problems when designing a flexible enough accelerator is how to deal with different convolution window sizes from layer to layer. To solve this problem, convolutions are converted to matrix multiplications by rearranging the input feature maps [20]. To reduce the overhead associated with map rearrangement, dedicated hardware units are used to convert the input maps into a matrix [21].
The utilization of a single monolithic hardware module to run layers with different characteristics may lead to low throughput, since different layers require different computing patterns. Alternatively, the module may be designed allowing for various options and exceptions, which becomes overly complicated. In this paper, 3D convolutions are calculated as long inner products, without any additional computational effort for rearranging the IFM. As such, the hardware structure to run 3D convolutions does not change with the window size.
The first generation of CNN implementations take performance as the main optimization metric. Recently, a few works based on the single module approach [22]- [24] have started to consider other metrics such as area and power, to enable design trade-offs.
In [22], only convolutional layers are considered, while in [23], [24], convolutional and fully-connected layers are considered but the same module is used in both. Instead of using the same hardware module in all layers, pipelines of layer-specific modules have been proposed [25], [26]. These architectures are quite efficient, since unique modules are optimized for each layer. However, they require significant memory resources to store all intermediate maps and weights, which may easily consume all available on-chip memory in low-density FPGAs. A trade-off between a single hardware module for all layers and a pipeline of layer-specific modules was proposed in [27], where layer subsets are mapped to a fixed set of modules. This solution sacrifices performance for resource efficiency and consists of several modules, each maps to a subset of the CNN layers. This system achieved a 2× speedup for SqueezeNet in a Virtex7 FPGA compared to state-of-the-art architectures.
To further reduce the number of computing modules, the present proposal considers only two different modules: one for convolutions and the other for fully-connected layers. Since the structure is the same regardless of the window size, it achieves the same efficiency as a solution with multiple modules while reducing the required area.
A pipelined accelerator for CNNs on low-density FPGAs was proposed in [28]. To reduce the required on-chip memory, it applies a layer-fused technique. With dedicated modules for each layer, the solution runs AlexNet at 80 Giga Operations per Second (GOPS) on a ZYNQ7020 FPGA. The problem of this approach is that it requires on-chip memory in all layers. Given the finite on-chip memory resources, it needs a complex circuit to orderly access external memory, which reduces the available resources to implement the computing cores. This unbalance between computing and memory resources reduces the performance efficiency of the solution for small FPGAs. Our approach also manages to map large CNNs into low-density FPGAs and achieves better performance efficiency.
Data quantization is an optimization technique that changes the type and reduces the size of the numeric representation of parameters in the model, and consequently reduces the required storage size and computing resources. In [29], 8-bit fixed-point representations are shown to guarantee an accuracy close to that obtained using 32-bit floating-point numerical representations. While fixed-point quantization schemes have proliferated, a new and notable floating-point representation, called the block floating-point scheme, using 8 and 16-bit data widths has been proposed [30]. The new representation reduces the accuracy loss by using simplified floating-point operations. The proposal has been tested using the VGG16 network on a Virtex 7 VX690T and has achieved a performance of 760 GOPS.
The data widths can be fixed for all layers or optimized for each layer. In [3], an implementation of a mixed-precision VOLUME 8, 2020 VGG16 network on a ZYNQ7045 FPGA, achieved a performance of 316 GOPS, almost three times better than previous approaches with fixed data sizes for all layers. In [4], a hybrid quantization scheme uses 8-bit fixed-point and shift quantization (powers of 2) to represent weights, with fixed 8-bit activations in all layers. The solution improves the hardware area due to shift quantization. However, like in [3], the system is implemented as a pipelined architecture and mapped to high-density FPGAs.
In [31], hybrid quantization is explored in implementations both for low-density FPGAs (edge computing) and highdensity FPGAs (cloud computing). However, the adopted bit-serial matrix multiplication architecture for low-density FPGAs has a negative impact on the system performance. The hybrid approach in [3] has the best performance efficiency overall, but its pipelined implementation of layers can not be mapped into low-density FPGAs. In [32], a core that uses 8 and 2-bit weights is proposed.
The solution proposed in this paper is able to run networks with hybrid representations in low-density FPGAs, since the same hardware module is able to run the inner products of vectors with different bit sizes.
In extreme quantization implementations, CNNs are converted to Binary Neural Networks (BNNs) [33], [34]. In BNNs, weights or both activations and weights are represented with a single bit, further reducing memory requirements. Networks having 1-bit weights can lose more than 10% accuracy for large networks. To obtain an accuracy comparable to using floating-point, a BNN needs from 2 to 11× more weights and operations [33]. It is also known that the first and last layers require full precision, forcing the architecture to support both representations. The accuracy gets even worse when both weights and activations are represented with a single bit. BNNs can be efficiently implemented with FPGA Look-Up Tables (LUTs), sparing the DSPs for performing additions only, which greatly reduces the resource utilization.
Our solution supports 1-bit weights but not 1-bit activations. Since we are targeting large CNNs, for which BNNs show considerable accuracy losses, we do not use fully binary networks.
A few works have considered the implementation of CNNs in a ZYNQ7020 FPGA with data quantization. [15] reported the implementation of a small CNN in this device, where the representation of weights and activations with 16-bit fixed-point data limits the average performance to 13 GOPS. In [35], AlexNet and VGG16 are implemented in the same ZYNQ7020 FPGA using 8-bit fixed-point data. The performance improves to 84 GOPS, partly because the data is represented with half the size compared to [15]. In [36], a low power 8-bit network implemented in the same ZYNQ7020 device, manages an average performance of 41 GOPS. A pipelined architecture using 16-bit fixed-point data, and pruning applied to fully connected layers, achieved a performance of 76 GOPS [28]. These works targeted lowdensity FPGAs but achieved a relatively low performance because a homogeneous data representation has been considered. Our proposal also targets low-density FPGAs but can achieve over 1 Tera Operations per Second (TOPS) of average performance using hybrid quantization.
In this paper, a configurable architecture to execute CNNs with hybrid data quantization is proposed. The proposal consists of a two-stage pipeline architecture with two separate modules to process the two types of regular layers: convolutional and fully-connected layers. Compared to previous CNNs, the proposed architecture improves on the inference performance and resource efficiency, making it suitable for low-density FPGAs as well as high-density ones.

III. BASELINE ARCHITECTURE OVERVIEW
The architecture proposed in this paper applies hybrid quantization to a state-of-the-art architecture that implements large CNNs in low density FPGAs using an 8-bit fixed-point representation format. This baseline architecture is a follow-up of the work presented in [5], [37] and is described in this section. The baseline architecture has one dedicated hardware module to run convolutional layers and another to run fullyconnected layers in parallel (see Figure 1). This allows for different optimization techniques to be applied to convolutional or fully-connected layers. Each block executes one layer at a time and must be configured before running each layer. The configuration specifies the number of kernels, their size, memory addresses, the existence of the pooling layer and its window size, and the scale of the fixed-point format. The input image and the feature maps are loaded to on-chip memory in order to be processed. Depending on the size of the image and intermediate feature maps, they may fit or not in the on-chip memory. If it does not fit, the image or the feature maps are divided and processed in pieces. The output feature maps generated by a layer are stored back in external memory, and reloaded for the next layer. This increases the volume of data communication with the external memory but allows for the execution of CNNs whose maps do not fit in the available on-chip memory.
Both the convolutional and the fully connected modules have a cluster of Processing Elements (PEs) comprising local memories and computing units. Pooling layers and activation functions are implemented outside the PE clusters, since they are applied only to the result of convolutions and inner products. The execution flow of a CNN model in the baseline architecture consists of the sequential application of the following steps to each layer: loading feature maps and weights to on-chip memory, executing convolutions or inner-products, and storing output maps in external memory. The process takes into consideration the size and number of on-chip memories. If the input maps do not fit in the feature map memory then they are loaded in smaller chunks. If the number of kernels is higher than the number of weight memories, they are loaded in small groups.
The architecture is controlled at three levels. The convolutional and fully connected modules each have a local configuration controller, and the complete accelerator has a general controller. The configuration of each layer is done by a host processor, including the configuration of the direct memory access (DMA) modules that read kernels from external memory and read/write feature maps to external memory. The PE cluster of the convolutional module has an array of cores organized by lines and columns. Each line of cores is connected to a single port of the feature map memory and therefore executes over the same block of activations. Each core of the line receives a different weight kernel. Therefore, each core in a line generates an activation of a different output map. Cores in the same column receive the same weight kernel. Each core in a column receives a different block of activations from the feature map memory. Thus, cores in the same column generate activations of the same feature map (figure 2). Each core has a parallel multiply-add (MAD) unit to calculate inner-products. Multiple weights and activations are read and processed in parallel by the MAD units. The baseline architecture has been implemented with 8-bit fixed-point data. Memory ports are 64-bit wide. Hence, 8 activations and weights are read and processed in parallel by each parallel MAD unit.
The baseline architecture can execute convolutions with different window sizes with the same hardware module. This is done by reading activations and weights as long vectors. The inner product of these two vectors is computed by the parallel MAD units.
Let us consider a set of input feature maps with size x p × y p × z p , where x p × y p is the size of the maps and z p is the number of maps. One output activation is the result of the inner product between a kernel with size x k × y k × z k and the corresponding input activations. Both 3D kernels and activations are read sequentially from memory following the order z, x, y. The addresses to read the activations in this order are generated by address generators configured before running the layer. Formally, the inner product between a kernel and the corresponding activations is given by: (1) where firstAddr is the address of the first activation of the block being convoluted. The complete convolution is obtained by sliding the kernel over the input maps with a particular stride, while applying equation 1 to each position. The output activation is then stored back in the feature map memory, or first pooled from the pooling window an then stored in the map memory. The advantage of this method is the fact it works regardless of the size of kernels or convolution windows.
The convolution of the 3D kernel with the 3D input map is given by Algorithm 1, considering the set of all 2D input feature maps (which together form a 3D input map of size x p ×y p ×z p ), a 3D kernel of size x k ×y k ×z k , a pooling window of size x pool × y pool and a stride of size t. In the algorithm, poolFunction is the maximum or average pooling function. The startAddr function adds the correct offset to the address pointer of the feature map memory, depending on the next activation to be calculated. The startAddr function takes as arguments the size of the input feature map, the pooling size, and the stride value. Given the first memory address where the feature map is stored (initialAddr), the startAddr function is given by: As described in Algorithm 1, each output neuron is calculated as the inner product of the kernel and an IFM block of activation values using equation 1. For example, considering a kernel of size 3 × 3 × 128, each output neuron is calculated as the inner product of the kernel and a 3 × 3 × 128 activation block. The 3D kernel slides over the whole input map, spanning the x p × y p space. When the stride is greater than one, the 3D kernel slides over the input map, skipping over input activation values according to the stride size. For example, a stride of two means that the every other activation is skipped, reducing the output map to half the size of the input map. The algorithm also includes the optional VOLUME 8, 2020 Algorithm 1 Convolution With a 3D Kernel Input: Set of all input 2D feature maps (3D input map) and one 3D kernel of weights Output: One output feature map result of the convolution of the 3D input map with the 3D kernel for r = 0 to y p /t − 1 do for m = 0 to x p /t − 1 do poolVar ⇐ 0 for l = 0 to x pool − 1 do for k = 0 to y pool − 1 do dp = y k −1 i=0 x k z k −1 j=0 W ix k z k +j × A startAddr(r,m,l,k)+ix p z p +j poolVar ⇐ poolFunction(poolVar, dp) end for end for neuron (m,r) ⇐ poolVar end for end for pooling step. In this case, the output neuron is generated only after computing all neurons in the pooling window. The algorithm sequentially calculates all neurons in the pooling window in order to merge the pooling layer with the convolutional layer.
The PE cluster of the fully-connected module consists of a matrix of cores with a structure identical to the PE cluster for convolutions. Each line of cores receives the activations of an image in the batch memory. Compared to the convolutional module, there is a major simplification in the address generators. Instead of an image convolution, a single dot product of the complete batched image and the kernel is performed. Parallel computation of dot products of a batched image and different kernels is possible by using multiple cores in a cluster line. The number of cores is limited by the available memory bandwidth, and is in general much lower than in the convolution PE cluster. This architectural difference between the two PE clusters is the main reason for having independent modules for convolutional and fully-connected layers, as it improves the hardware efficiency.

IV. HYBRID ARCHITECTURE
This work modifies and extends the baseline architecture to support the execution of layers with different weight sizes (8, 2 and 1 bit). Two new hybrid cores are proposed, which use these weight sizes and are able to efficiently calculate multiple dot-products. The size of the activations is kept fixed, since the precision of activations has a higher impact on the network accuracy than the precision of weights. The architecture described considers 8-bit activations but the design can straightforwardly support other activation sizes.
In the baseline architecture, each core receives eight 8-bit activations and eight 8-bit weights of a kernel in parallel. The new hybrid core always receives 64 bits of activations and 64 bits of weights. The dot-product level of parallelism is always 8, since there are only 8 different activations available in each cycle. However, in layers whose weights are represented with fewer bits, the 64 bits of weights can represent more than 8 weights. Since there are only 8 activations available, the 64-bit weight word is used to represent weights of different kernels.
The hybrid cores proposed in this work are able to calculate a number of dot-products in parallel, depending on the weight sizes. Consider, for example, a CNN where some layers have 8-bit activations and 8-bit weights and other layers have 8-bit activations and 2-bit weights. The hybrid core will execute one dot-product between eight 8-bit activations and eight 8-bit weights, in the first case, and four distinct dot-products between eight 8-bit activations and eight 2-bit weights, in the second case.
The two hybrid cores proposed herein are the following: one for 8-and 2-bit weights, named C8:82, and another for 8-and 1-bit weights, named C8:81. Their architectures are described in the following two sub-sections.

A. HYBRID CORE FOR 8-BIT ACTIVATIONS AND 8/2-BIT WEIGHTS -C8:82
This subsection details the hybrid core that supports the execution of two different activation × weight representations: 8×8 and 8×2 (C8:82). As stated above, using a different fixed size for activations is straightforward, since the architecture is generic for this parameter.
The approach proposed for the hybrid calculation consists on designing the core to conditionally execute 4 partial products of eight 8 × 2 dot-products when the layer is requesting an 8 × 8 dot-product.
Let us consider eight 8-bit activations, A 0 , A 1 , . . . , A 7 , that are read in parallel from the on-chip memory in each clock cycle. Weights are read from the weight memory. With 2-bit weights, 4 groups of eight 2-bit weights are read in parallel, W 00 , W 01 , . . . , W 07 , W 10 , W 11 , . . . , W 17 , W 20 , W 21 , . . . , W 27 and W 30 , W 31 , . . . , W 37 . Then, the 4 dot products, DP 0 , DP 1 , DP 2 and DP 3 are calculated as: These calculations are done using 4 dot-product units, each implemented with 8 cascaded MAD units (Figure 3). The eight 8-bit weights W 0 , W 1 , . . . , W 7 are set as W k = W 3k W 2k W 1k W 0k , where k = 0, 1, 2 . . . , 7. Then, the 4 DP core outputs correspond to partial products that must be added to calculate the final 8 × 8 dot-product: Equation 7 is executed only after calculating the complete convolutions. Therefore, it is implemented outside the core, and the addition is shared by all cores of a PE cluster line, followed by the ReLU activation function (not shown in figure 3).
Each cascaded MAD unit determines R = X + A × W , where A = 7 i=0 a i × 2 i is an 8-bit activation, W = w 1 × 2 + w 0 is a 2-bit weight and X = 9 i=0 x i × 2 i is the 9-bit output of the previous MAD.
The 2-bit sections represent different numbers depending on the layer weight size. If the weight size is 8 bits (mode 0 of operation), the 2-bit sections are partials of the 8-bit weight and represent the numbers {0, 1, 2, 3}, except for the most significant two bits which represent the numbers {−2, −1, 0, 1}. If the weight size is two bits (mode 1 of operation) then the 2-bit sections represent the numbers {−1, 0, 1}. Let us first consider the case of weights represented with the 2-bit signed numbers {−1, 0, 1} (mode 1). In this mode, X is added with one of the multiples of A in {-A, 0, A}, depending on the 2-bit weight value, as shown in table 1.
The 6-input LUTs of FPGAs can implement two 5-input functions as long as the inputs are common. The addition of X with one of these multiples can be efficiently implemented with (m + 3) 6-input LUTs and the carry-chain, where m = 8 is the number of bits of the activation. Note that the generate and propagate signals of the carry chain are functions of only four variables (w 0 , w 1 , x i , a i ). The LUTs produce the appropriate generate and propagate signals (and carry-in) to conditionally subtract/add A (or 0) from/to X .
For the multiplication by 8-bit weights (mode 0), one first determines the partial products of the activation by two bits of the weight. In this mode, the two bits of the weight represent the unsigned numbers {0, 1, 2, 3}, except the most significant 2 bits, which represent the signed numbers {−2, −1, 0, 1}.
When W is a signed 2-bit number, we need multiples A and 2A to implement X + A × W . In this case, the generate and propagate signals are functions of the five variables (w 0 , w 1 , x i , a i , 2a i ), and thus can be implemented with the same number of LUTs as with the signed 2-bit independent weights described before. However, in the case of the multiplication by the unsigned 2-bit weights {0, 1, 2, 3}, the generate and propagate signals are functions of the six variables w 0 , w 1 , x i , a i , 2a i , 3a i , since the 3× multiple is also needed, and cannot be implemented with a single LUT.
However, using a modified Booth recoding algorithm [38] and the implementation method proposed in [39], it is possible to avoid the 3× multiple, and implement the addition of a variable using a 5-variable function with a single level of LUTs.
The modified Booth algorithm recodes two bits at a time with overlapping 3-bit groups (a '0' is appended to the right of the number). The 3-bit recoding is done according to table 2.  + (−2) × 2 4 + 2 × 2 2 + (−1)) × A. In this case, the propagate and generate functions of the multiply accumulate operation R = X + A × W are still functions of the six variables w 0 , w 1 , w −1 , a i , 2a i , x i , since recoding requires 3-bit inputs. However, using the method proposed in [39], variable x can be added with Finally, both functions must be implemented with a single MAD, which implements both operation modes, 8-bit weights (mode 0) and 2-bit weights (mode 1). Therefore, an extra input is needed to specify the mode and, again, an addition of variable x with a 6-variable function is needed. To avoid this 6-input function, a different recoding is considered that takes into account the operation mode (see recoding in table 3). Given the mode of operation, the weights are recoded into three bits (wr 2 , wr 1 , wr 0 ): one bit specifies the sign (wr 2 ) and the other two encode the multiples 0, A and 2A. In this method, the weights are first recoded and then sent to the MAD unit, where the applied function f i (wr 2 , wr 1 , wr 0 , 2a i , a i ) is given by: Input wr 2 is the carry-in bit needed to implement the 2's complement of the multiples. The recoded weights are shared by all cores in the same column of the PE cluster, and a single LUT can generate two recoded bits. This way, the resource overhead associated with recoding is very low. The final C8:82 circuit diagram is illustrated in figure 5. The core inputs are the activation multiples A and 2A, the 2-bit weights and the MAD's second operand sign. In addition, a pipeline level has been added to improve the throughput of the circuit (other pipeline levels can be easily considered). The chain of eight MAD units has been broken into two sub-chains of four MAD units each, in order to reduce the delay of the complete cell.
The cores of the baseline architecture have been replaced with the new hybrid C8:82 core. Recode units have been added after the weight memories in both the convolutional and fully connected modules. The conditional addition of the partial products has been included before the shift and activation function. Figure 6 illustrates the new architecture of the convolutional module. The fully connected module follows a similar structure, as explained.

B. HYBRID CORE FOR 8-BIT ACTIVATIONS AND 8/1-BIT WEIGHTS -C8:81
The design of the hybrid C8:81 core, supporting the product of 8-bit activations by 8-bit weights, 8 × 8 (mode 0) and the product of 8-bit activations by 1-bit weights, 8 × 1 (mode 1), follows a similar approach as that for the design of core C8:82.
The C8:81 core generates 8 dot products, conditionally added as partials of an 8 × 8 product for 8-bit weight layers. For this core, a more efficient hardware solution is  implemented by multiplying and adding two activations in a single unit.
Each double multiply-add unit (MAD2) multiplies and accumulates two activations and two 1-bit weights, and adds the result of the previous MAD2. Four MAD2 units work together to calculate the dot-product between 8 activations and 8 weights (see Figure 7). Considering 8-bit weights, the dot-product A · W , between eight activations and eight 8-bit weights W 0 , W 1 , . . . , W 7 , is calculated by considering each weight as W k = W k7 W k6 W k5 W k4 W k3 W k2 W k1 W k0 , where k = {0, 1, 2 . . . , 7} and then adding the eight partial products as: As before, equation 9 is implemented outside the core and before applying the activation function. Each MAD2 unit implements the dot-product between two 8-bit activations, A a and A b , and two 1-bit weights, W a and W b and adds it to the output of the previous MAD2, X . Formally, it calculates: The 1-bit weight has different interpretations depending on the weight size. If the size is 1 bit then it represents the numbers {−1, 1}; if the size is 8 bits, the 1-bit weights are partials of the 8-bit weight multiplication and represent the numbers {0, 1}, except for the most significant bit that represents the numbers {0, −1} (2's complement property).
Let us consider two 8-bit activations, A a and A b , two 1-bit weights, W a and W b , and the selector mode to choose the weight size or mode of operation. The operations to be executed depend on the weights according to table 4. According to these functions, X is being added/subtracted to/from 7-variable functions A a , A b , (A b + A a ), or (A b − A a )). As explained above, this function has to be reduced to five variables so it can be implemented with a single level of LUTs. Let us consider Y = X + (A a + A b ) and rewrite the expressions in table 4 terms of Y. The result is in table 5).  For the first three combinations of weights, some function f or its double 2 × f is subtracted from Y. For W a = W b = 1, there is no subtraction (0 is subtracted from Y). Hence, 3 MAD2 inputs are selected using 3 multiplexers depending on the mode (see figure 8). Then, a 5-input (W a , W b and the 3 multiplexer outputs) function is subtracted from Y .
Being a 5-input function, it can be implemented with a single level of LUTs.
In the first MAD2 of the cascade X = 0, making Y = A a + A b , which is already the input of the first multiplexer. However, X , which is the result of the previous MAD2 block, needs to be added to A a +A b before being input to the MAD2. This would require an extra adder between MAD2s whose outputs are not shared with the MAD2s of other cells. A simpler solution is to do all the required additions beforehand, such that the input of the first MAD2 block is Y = k−1 i=0 A i , where k is the number of activations in the (sub-)chain of MAD2s. Doing this addition at the beginning avoids adders between MAD2s, and allows Y to be shared by the other DP i cells in the core.
The extra logic required to add the activations before the MAD2 sub-chain and to implement the multiplexers is shared by all cells of the core and by all cores of the PE cluster. This sharing considerably reduces the overhead associated with the extra logic and makes the solution quite efficient. The final C8:81 circuit diagram is illustrated in figure 9. The MAD2 units have the same structure of the MAD units used in the C8:82 core (see figure 4) but function f i in this case is given by: where signals i n correspond to the MAD2 inputs shown in figure 9. As in the C8:82 core, the MAD2 chain is divided in two sub-chains of two MAD2 units each, in order to reduce the delay of the complete cell.

V. RESULTS
The proposed hybrid cores have been implemented using the Vivado 2019.1 Design Suite and run on the ZedBoard FPGA card. The Zedboard card comprises a low-density ZYNQ7020 device belonging to the Artix-7 FPGA series, which contains a dual ARM Cortex-A9 CPU. The circuit operates at 200 MHz. To demonstrate its scalability, the proposed hybrid architecture has also been implemented in a Xilinx SoC ZC706 Evaluation Kit, containing a ZYNQ7045 device of the Kintex-7 FPGA series, also featuring a dual ARM Cortex-A9 CPU. With this device, an operating frequency of 240 MHz has been used. ZYNQ FPGAs have four 64-bit High-Performance ports working at 150 MHz, which give the programmable logic direct access to the external memory. The measured external memory bandwidth is 3.3 GBytes/s on the ZedBoard and 4.2 GBytes/s on the ZC706 board.
All architectures have been described in VHDL, simulated and implemented with Vivado. In particular, the hybrid cores have been described by direct instantiation of the FPGA primitive instances LUT6_2 and CARRY4. The The AlexNet and VGG16 networks have been trained for each hybrid size combination using a framework developed in the scope of this work, which is an extension of Ristretto [29] integrated with Caffe [40]. To reduce the training time of a large set of CNNs with different hybrid sizes, only the new architectures 8:8228, 8:8218 and 8:8118 have been considered. The framework also allowed validating the results of the FPGA implementations. On a first validation level, the outputs of all layers have been compared with their expected values, in order to determine the correctness of the internal node values. On a second validation level, the classification results for each image have been compared with the framework inference results. All networks have been trained for the ImageNet data set [41]; their achieved accuracy is indicated in table 6.  There is an accuracy degradation of less than 4.5% between the baseline architecture and the 8:8118 architecture. This is expected since there is a considerable reduction in weight size.

A. AREA RESULTS
The convolutional and fully connected cores determine the FPGA area occupation of the architecture. The cores have been designed so that both LUT and DSP resources of the FPGA are used in a balanced way. To achieve this, different implementations of the cores have been considered with different numbers of DSPs. Table 7 shows the resources occupied by the cores in 2 alternative implementations, A and B.  The overall resource utilization of the architectures when implemented in the ZYNQ7020 and ZYNQ7045 devices is shown in tables 8 and 9, respectively. The architectures have been configured for best performance and tailored for the target CNN network. All architectures have been designed with a comparable number of resources, for a fair comparison. The batch sizes are larger for AlexNet, since the number of weights of the fully connected layers is almost 30× greater than the number of weights of the convolutional layers. For the VGG16 network, the distribution of weights is evener, allowing for a smaller batch. Additionally, since the number of resources of the hybrid cores varies with the weight size, the total number of cores of the architectures also varies with the weight size. For lower weight sizes, the hybrid core is larger, and fewer cores can be mapped to a particular FPGA device. The PE clusters occupy most of the total resources of the architecture (around 90%), with an area given by the number of cores multiplied by the resources of a single core (tables 8 and 9).

B. PERFORMANCE RESULTS
The performance results obtained when running the inference step for AlexNet are shown in table 10 for all architectures. The hybrid architectures achieve throughputs ranging from 448 to 508 images per second on the ZYNQ7020 device, and from 1429 to 1639 images per second on the ZYNQ7045 device. These results clearly show that, by using hybrid data quantization, it is possible to run large CNNs in low-density FPGAs while meeting real-time requirements (>30 images per second). Compared to the baseline architecture, the best hybrid architecture improves the image throughput by 2.2× on the ZYNQ7020 device and 2.1× on the ZYNQ7045 device. The highest measured performance is 735 GOPS for the ZYNQ7020 and 2.38 TOPS for the ZYNQ7045 with AlexNet. Considering VGG16 the measured performance increases to 1.34 TOPS for the ZYNQ7020 and 2.5 TOPS for the ZYNQ7045. These results come at the cost of a small accuracy reduction: from 1.5 to 3.7%.
In terms of performance efficiency, measured performance/kLUT and measured performance/DSP, the results show that the ZYNQ7020 is more efficient. This has to do with the fact that, in spite of a 4-fold increase in the resources used, when using the ZYNQ7045 instead of the ZYNQ7020, there is not a proportional increase in the external memory bandwidth: from 3.3 to 4.2 GBytes/s only. This shows the importance of the memory bandwidth in the design of CNNs.
For the VGG16 network, the performance results follow a similar trend (table 11). VGG16 is a larger network compared to AlexNet. The computation/communication ratio is larger for VGG16 compared to AlexNet. This explains the better performance efficiency when running VGG16. In spite of this, the 4× more resources of the ZYNQ7045 device do not lead to a 4× faster design, since, as already mentioned, the memory bandwidth does not scale proportionally.
To better understand the influence of the hybrid cores over the execution times of each layer, the average processing times of each layer for the various architectures are shown in figure 10.    The first layer always uses 8-bit weights in all architectures. Since the baseline implementation has more cores, the execution time of AlexNet's first layer in the baseline architecture is lower than that of the hybrid architecture. In all other layers, the hybrid architectures achieve the best execution times.
When running VGG16, the hybrid architectures achieve the best execution times for almost all layers. The execution time of the convolutional layers decreases with the weight data size as expected. Exceptions exist for the first and second layers running on the ZYNQ7045 device. For the first layer, the communication time for weights exceeds the execution time. For the second layer, the fact it has only 64 different kernels limits the exposed parallelism, making it useless to increase the number of operations.
In the fully connected layers a significant difference between the baseline and the hybrid architectures is observed, caused by the time to transfer weights from external to internal memories. Another observation is that the 8:8118 hybrid architecture is slower than the other hybrid architectures in the last layers, when running VGG16 on the ZYNQ7045 device. Once again, this is due to the inability of using all the available MAD units in parallel.

C. COMPARISON WITH THE STATE-OF-THE-ART
The performance and area of the proposed hybrid architectures have been compared with the implementations of previous works using the same FPGAs. The overall comparison results are shown in table 12.
For the ZYNQ7020 device, the proposed hybrid architectures increase the image throughput by more than 5.5× compared to the best state-of-the-art approaches for running AlexNet with 8-bit fixed-point data. All the proposed hybrid architectures achieve a measured performance which is 2.5× higher than the binary network architecture for classifying CIFAR-10 images proposed in [43]. For running the VGG16 network, the proposed 8:8228 and 8:8218 architectures achieve an image throughput which is 12× higher than the architecture proposed in [35], while keeping the same accuracy. The 8:8118 architecture achieves a slightly higher throughput (13.7×), at the cost of a 3.5% accuracy degradation.
The architectures have also been compared with previous works implemented on a ZYNQ7045 device, as shown in table 13. For the AlexNet network, the proposed solutions almost double the image throughput with better performance efficiencies. For the VGG16 network, the proposed hybrid architectures also show better performances than previous works, achieving a throughput 1.5× higher than the pruned solution proposed in [46], and 9.3× higher than the solution proposed in [45].

VI. CONCLUSIONS
This work proposes a new configurable architecture for the execution of CNNs, which efficiently supports hybrid data quantization. The architecture targets low-cost FPGAs, and constitutes an advantageous trade-off between performance (or performance efficiency) and accuracy: it significantly boosts performance and performance efficiency in exchange for a low accuracy degradation. Furthermore, its scalability allows taking advantage of larger FPGAs where more cores can be deployed with a proportional performance increase.
Running the AlexNet network, the proposed hybrid quantization architecture achieves a performance of 735 GOPS on a ZYNQ7020 FPGA, and 2.375 TOPS on a ZYNQ7045 FPGA. For the VGG16 network, the measured performance increases to 1.341 TOPS on a ZYNQ7020 FPGA, and 2.513 TOPS on a ZYNQ7045 FPGA. These results clearly show that it is possible to run large high-performance CNNs on low-density FPGAs for embedded devices, which is a technological contribution to enable the deployment of large CNNs on edge nodes and end devices.
To extend the applicability of the proposed solution to irregular networks, the architecture is now being modified to support particular convolutional layers that exist in irregular networks. Two interesting such layers are the GoogleNet Inception and the ResNet Residual layers. Also, hybrid quantization of both weights and activations is being studied in terms of network accuracy and architectural design.