Efficient, Geometry-Based Convolution

Several computationally intensive applications in machine learning, signal processing, and computer vision call for convolution between a fixed vector and each of the incoming vectors. Often, the convolution need not be exact because a subsequent processing unit, such as an activation function in a neuron network or a visual unit in image processing, can tolerate a computational error, hence allowing the optimization of the convolution algorithm. This paper develops a method of approximate convolution and quantifies its performance in software and hardware. The key idea is to take advantage of the known fixed vector, view a convolution as a dot product, and approximate the angles between the fixed vector and an incoming vector geometrically. We evaluate the proposed method in terms of the accuracy, running time complexity, and hardware power consumption on the field programmable gate array (FPGA) and application-specific integrated circuit (ASIC) hardware platforms. In a benchmark test, the accuracy of the approximate convolution is 3.7% lower than that of the exact convolution, a tolerable loss for machine learning and signal processing. The proposed method reduces the number of operations in the hardware and reduces the power consumption of conventional convolution and the existing approximate convolution by approximately 20% and 10%, respectively, while maintaining the same throughput and latency. We also test the proposed method on 2D convolution and convolutional neural network (CNN). The proposed method reduces the complexity, power consumption for 2D convolution, power consumption for CNN of the conventional method by approximately 22%, 25%, and 13%, respectively. The proposed method of approximate convolution trades off accuracy with running time complexity and hardware power consumption, and it has practical utility in computationally intensive tasks that tolerate a margin of convolutional error.


I. INTRODUCTION
Discrete-time convolution is an important operation in machine learning, signal processing, and computer vision [1]. For example, the neural network architecture [2] AlexNet [1] uses five embedded convolution layers for image recognition to achieve a recognition rate of 84.7%. ResNet-152 [3] uses 152 convolution layers to improve classification accuracy and achieve a recognition rate of 96.5%. The more convolution layers there are, the better the performance is, leading to a massive use of convolution operations.
Although extensive convolution improves performance in one area, it undesirably leads to high computational complexity [4] and power consumption. AlexNet takes 1.4 giga operations per second (GOPS) to process a single 224×224-pixel 2 image, while ResNet takes 22.6 GOPS [5]. This high computational complexity prevents massive convolution from taking place on ubiquitous small devices and Internet of Things (IoT) nodes that have limited computational resources. An extensive number of convolutions also leads to high power consumption, mainly due to memory access and processing elements in the hardware. To see the amount of power used, consider a typical example of massive signal filtering, which performs convolution 10 million times, each time between two 1000-element vectors. On hardware using 45 nm CMOS technology, one memory access consumes 5 pJ of power on a 32-bit SRAM cache and 640 pJ on DRAM [6]. Massive convolution is not possible on a fast SRAM cache because the power consumption of (10M − 1000)(1000)(5pJ) ≈ VOLUME 4, 2016 0.005W on the local memory is too large. In this example, the convolution must be performed on a slower DRAM, consuming (10M − 1000)(1000)(640pJ) ≈ 6.4W of power, an amount similar to CPU power usage [7]. High computational complexity and power consumption are bottlenecks in hardware and require a new design strategy for convolution.
Applications that require massive convolution have a common characteristic that can be exploited for efficiency and lower power consumption, namely, a tolerance for computational errors [8]- [10]. In image processing, a small visual error in an output image is acceptable and often undetectable to humans due to limitations in human perception. In deeplearning classification, exact computation of convolution at a neuron is less significant than the method of inference that is being used. Convolutional neural networks tolerate errors from approximated schemes [11] and accept computations with 4-bit precision without sacrificing inference accuracy [12]. Furthermore, many digital signal processing applications rely on probabilistic models, which have been designed to deal with noisy data. A trade-off between accuracy and hardware-software efficiency is the key to developing a method of convolution.
In this paper, we develop a method of approximate convolution that is sufficiently accurate and consumes low power. This method preprocesses the known, fixed vector in convolution to save subsequent computational effort, treats the convolution as a dot product between two vectors, and efficiently approximates the angles between the two vectors geometrically. The main contributions of this paper are the following: • A method of approximate convolution that is suitable for massive error-tolerant applications. • The hardware architecture of the proposed method for field programmable gate array (FPGA) and applicationspecific integrated circuit (ASIC) platforms. • A comparison of the proposed method with state-of-theart methods in terms of accuracy, running time complexity, number of hardware operations, througput, latency and power consumption.
The proposed method reduces the number of operations in hardware and reduces the power consumption and the area by approximately 25% and 13%, respectively, while incurring a loss of approximately 3.8% in accuracy compared to conventional convolution. The proposed method has practical utility in machine learning, signal processing, and computer vision applications that require massive convolution and can tolerate computational errors. The rest of the paper is organized as follows. We review the existing literature in Section II and give the problem statement in Section III. We develop the proposed method of approximate convolution in Section IV and describe the hardware implementation architectures for 1D and 2D convolutions in Section V. Finally, we evaluate and discuss the performance of the proposed method in Section VI, and conclude the important findings in Section VII.

II. RELATED WORK
Methods of reducing the computational complexity and power consumption of convolution can be broadly classified into two categories: exact convolution and approximate convolution. Exact convolutions are suitable for precise applications, which cannot tolerate errors, and they reduce power consumption by optimizing multiplications in the hardware. Approximate convolutions are suitable for error-tolerant applications, such as those focused on in this paper. We review the related work in both categories.
Exact convolutions divide a convolution operation into subproblems that involve the multiplication of the input vector with constants [13]. To optimize multiplication, related work [14]- [17] uses an adder graph tree structure that expresses addition, subtraction, and bit shifting. The nodes of the graph represent an adder or subtractor, and the edge weights belong to the bit shifts. These tree structures dominate the number of multiplications, while the number of addition and subtraction operations significantly increases, which is the main issue for constant multiplication. M. Kumm et al. [16] obtains the optimal number of addition or subtraction operations by reducing the adder graph. To implement this design, M. Kumm et al. [17] uses a pipeline architecture to optimize the adder depth and timing constraints.A fast, exact multiplication results in a fast, exact convolution.
In addition to research work that optimizes a multiplication, several studies treat an exact convolution as a matrixvector multiplication [18]- [26]. To reduce the complexity, a kernel expansion is used with the approximate Fastfood transform [19]. Other methods that use a fast convolution algorithm include the FFT convolution and Winograd convolution, which transform matrix multiplications into the Hadamard products [20]. Fast-convolution algorithms reduce the number of operations from Θ(n 2 ) to Θ(n log 2 n) and incur some errors from the kernel transformation, where n is the number of rows and the number of columns of the convolution matrix for the Fastfood and FFT convolutions. For the Winograd convolution [25], many transformed matrices exist to provide fast results, although memory issues may occur when amortizing those matrices [21]- [24]. Another issue of the Winograd convolution is numerical instability [26], which occurs, for example, in an implementation [27] due to huge kernels. Fast convolution methods can be implemented in hardware using a parallel and pipeline architecture, which results in a high throughput. However, a drawback of fast convolution methods is in large power consumption. Related work in the category of exact convolution focuses on resource utilization and throughput.
Approximate convolutions use approximate adders or approximate multipliers in circuits. K. Du [28] introduces carry-select adders, which gain speed at the expense of circuit area and achieve mean relative error distances of approximately 2% to 10% compared to a 16-bit adder. Another architecture is the approximate full adder, which modifies the truth table of the full adder circuit [29], [30]. This architecture occupies a small circuit area and reduces the delay of its throughput by approximately 50%, but it has an accuracy of approximately 60%. This level of accuracy is considered low and may cause errors to accumulate in a massive convolution operation. On the other hand, approximate multipliers can be achieved with one of these two approaches. The first approach is to approximate a partial product by using the K-Map for a 2 × 2 bit multiplier to create a 4 × 4 bit multiplier [31]. This approach also includes the approximate multiplication of [32], which generates the partial product by shifting the bits of a multiplicand and has an approximately 10% mean relative error. The second approach is to approximate a multiplier by using approximate compressors [33]- [42], such as the 4-2 approximate compressor. Although the accuracy of the approximate multiplier depends on the multiplication architecture, the mean relative error distance is often 5-15%, while the power reduction compared to exact multiplication is approximately 10-50%. In other words, the greater the power reduction is, the lower the accuracy of the multipliers.
Existing methods of approximate convolution are limited in a fundamental way. The number of addition and multiplication operations cannot be reduced due to the computation equation, which accounts for the restriction of power consumption improvement while maintaining the accuracy of the system. A method that addresses these limitations will break through the restriction on the number of operations, leading to power reduction while sacrificing little accuracy.

III. PROBLEM STATEMENT
We are interested in exploring a convolution implementation that yields better performance per watt by sacrificing some computational accuracy. The performance of an implementation of an algorithm can be measured by means of throughput and latency. To gain better performance, we usually have to pay a price in terms of the area and power consumption of the hardware. In general, for a given computation task, multiple algorithms usually exist by which correct computation results are produced with different performance levels. In turn, there exist various approaches for implementing a given algorithm on a chosen computation platform, resulting in a very large space of possible solutions. Hence, we need to scope down this solution space.
In this paper, we target our convolution implementation on an embedded computation platform whose computational resources are limited in terms of both area and available power, while the needs for high throughput and low latency continue to grow. One approach for overcoming these challenges is to implement the computation of interest as hardware, which is a digital system specifically designed to perform the computation problem. In particular, we target our design for a field programmable gate array (FPGA) or application-specific integrated circuit (ASIC) chip.
With the target hardware setup, we are interested in convolution computation by the following definition. Let x and h denote two discrete-time functions acting as the input signal and the impulse response, respectively. We are interested in applying convolution in cases where the impulse response h has a finite number of tabs (weights) that are predetermined. The input signal x is a window of a streaming discrete-time signal. Moreover, this implementation is for applications that can tolerate some computational error at each sample point of the convolution. In fact, such errors are already inherited in the digital abstraction due to the finite representations of numbers. Furthermore, since we aim for an embedded implementation, we adopt an n-bit fixed-point representation in which the zero level is 2 n−1 for each sample of h and x. Based on this setup, we define the computation task as follows. Let x[j] denote the jth sample point of the streaming input x, j = 0, 1, . . . , M − 1 and let h[m] denote the mth coefficient (tab) of the filter impulse function h, for m = 0, 1, 2, ..., N − 1. Without loss of generality, and by switching the computation task of interest is the convolution of the function x of M points with the function h of N points, which can be described by Eqs. 1 and 2: where h is a vector of size N that stores the coefficients (tabs) of the filter impulse function h (Eq. 3) and x k is a vector storing the N points of data collected from the sliding window covering the input samples from point k − N + 1 to k as described in Eq. (4), where x[j] = 0 for j < 0 and for j ≥ M : However, this nice relationship does not help reduce the complexity of the dot product computation; i.e., it still requires a number of basic computations, multiplication and accumulation (MAC), on the order of Θ(N ). In this paper, we aim to exploit this relationship to reduce the computational complexity by sacrificing some computation accuracy.
Our proposed idea is based on the geometric definition of the dot product following Eq. (5): where a and b are two vectors of the same size. Applying the dot product's geometric interpretation to the computation of y[k] and y[k + 1], we have where θ k is the angle between vectors x k and h.
As |h| are precomputed because the vector h is known, we need to calculate |x k |, cos θ k , |x k+1 |, and cos θ k+1 . We can exploit the relationship of x k and x k+1 in computing |x k | and |x k+1 | as follows: Due to the shifting property of x k and x k+1 , we can reduce the number of squares in computing the two magnitudes from 2N to N + 2 by amortizing a large part of |x k+1 | 2 in |x k | 2 . Moreover, since we have to compute at least N consecutive dot products, we can reduce the square computations from N 2 to 3N −2. Combined with the two multiplications among |h|, |x k |, and cos θ k for each y[k], the overall number of multiplications for computing N -point convolution is approximately 5N given that cos θ k is computed separately. Although this computation is on the same order as the original computation, the hardware architecture for implementing this computation is simple. This advantage, however, comes with the price that we need to compute the value of cos θ k for each computation of y k . Next, we explain the proposed approach to tackle cos θ k . Finding cos θ k is a challenge when the vector's dimension is large. A known algorithm for finding θ is by using a rotation matrix [43]. However, this method is only practical in 2D space because finding a rotation matrix is difficult for a high-dimensional space. Moreover, the complexity of applying the rotation matrix to find θ is also high. Given that the coordinates of the two vectors are known, the best way to find the angle θ between the two vectors is to use the dot product's geometric definition as in Eq. (10): However, in our case, the dot product is what we want to compute. We need a clever method to apply to Eq. (10). We propose an approximate computation of cos θ k . Based on Eq. (5), we observe the effects of the drift of θ on cos θ by plotting the percent errors of cos θ against the drift in θ denoted by ∆θ, as shown in Fig. 1. We partition the variable θ from 0 • to 90 • in two parts, 0 • < θ ≤ 45 • and 45 < θ ≤ 90 • . Since the rate of change in cos θ increases as θ does, the error in cos θ of the upper part of the angle is higher, as shown in Fig. 1. Nevertheless, if we can maintain the drift in θ at approximately 5 degrees, the average absolute difference for cos θ and cos(θ + ∆θ) is approximately 5% and 10% for the lower and higher ranges of θ, respectively.These results indicate that approximation of θ k can tolerate higher errors.
Given that h is known, the two keys of our proposed implementation of convolution computation are as follows: (1) we propose an efficient method for approximating θ k so that it can be applied in computing convolution using the geometric definition of the dot product with the assumption that some computation error is allowed, and (2) the computation of the magnitude of x k+1 , the input vector for generating y[k + 1], is amortized due to the relation of x k+1 and x k . We explain the proposed method for the θ k approximation in the next subsection. Then, the overall convolution computation based on the geometric definition of the dot product is described.

B. VECTOR ANGLE APPROXIMATION
Given a constant vector h in a vector space of dimensions, we want to estimate the angle θ of vector x with reference to h. Let us consider the case of vectors in the 3-dimensional vector space shown in Fig. 2 as an illustration of the proposed idea. To simplify the discussion and without loss of generality, we will assume that the coordinates of two vectors, the constant vector h = (h 1 , h 2 , h 3 ) T and the input vector x = (x 1 , x 2 , x 3 ) T , lie in the positive cube. This assumption is generalized by the fact that we represent data using an n abit unsigned number; i.e., all components of the two vectors, h 1 , h 2 , h 3 , x 1 , x 2 , and x 3 , are positive. The proposed angle approximation method is based on the following ideas: • We start with the idea of replacing x with a vector, denoted byx, that has the same direction as x, by which the angle can be computed following Eq. (11): The first step in this approach is to findx such that the computationx · h has less complexity. • To this end, we propose using a binary vector of size N , denoted by x bin which approximatex and, whose coordinates can be expressed as a 1-bit number in each dimension. Fig. 2 illustrates this idea for the case of 3D space. With this idea, the vector x is quantized to one of the seven nonzero corner vectors of the unit cube.
Extending the idea to an N -dimensional space, x bin can be generalized as Eq. (12).
Using x bin , we eliminate the multiplications in the computation of the dot product x bin · h as described in Eq. 14 since the element b j in dimension j of x bin is either 0 or 1; i.e., the dot computation is just a sum of the elements j of h for all j such that b j = 1. As a result, the approximation of θ, denoted byθ(x dot ), can be calculated following Eq. (15): θ For example, in the 3D space, x bin will be one of the seven nonzero coordinates at the corners of the unit cube, as shown in Fig. 2; i.e., for the input vector where x j is encoded as an n a -bit unsigned number, the estimated vector x bin of x will be one of the seven nonzero 3-bit binary combinations The idea is to quantize each vectors in the -dimensional space to one of the 2 − 1 vectors. Although this seems to be a rough estimate for the angle θ, it yields a much better estimate for cos θ, especially for θ ≤ 45 • , because cos θ changes slowly for small θ; i.e., with a limited n abit representation of numbers, cos θ is the same for a wide range of θ. Moreover, θ is quantized in the lookup table (LUT) implementation of cos θ, as shown in Fig. 3. However, this estimation method will have a greater effect for θ k > 45 • . • While using x bin reduces the computational complexity of x dot = x bin · h, we still need to deal with the computation of arccos xdot |xbin||h| . To do so, we rely on the fact that although x varies, the vector h is known and predetermined. Using this fact, we estimate θ by a linear function of x dot as described in Eq. (16) using linear curve fitting to find the parameters P 1 and P 0 for all cases of x dot given h: That is, for a given h, we computeθ(x dot ) using Eq. (15) for all possible cases of x bin . Then, parameters P 1 and P 0 are determined using linear regression curve fitting, as shown in Fig. 4 for the case of random h with dimension = 20. • Note that the parameters P 1 and P 0 are calculated with VOLUME 4, 2016 Algorithm 1 Algorithm for estimating the angle of vector x with reference to a constant vector h Require: Parameters P 1 , P 0 , and B are predetermined given reference to x bin , a vector whose direction estimates that of vector x. Therefore, to further improve the estimation of θ, we take random samples of x and compute the actual θ using Eq. (11) and its estimateθ (1) using the parameters P 1 and P 0 from the previous step. We find that there are shifts ofθ (1) from the actual θ. As a result, we compensate for this bias with the parameter B, which is the average of the drifts, resulting in the final equation for angle estimation in Eq. (17): Based on these ideas, the proposed algorithm for estimating the angle θ between a vector x and a constant vector h is described in Algorithm 1. The next subsection explains the proposed method of finding x bin used in the function BinarizedX in Algorithm 1.

C. FINDING X BIN
Given x = (x 1 , x 2 , . . . , x ) T , where x j is an n a -bit unsigned number, we want to quantize x j to the 1-bit b j . Using the case of 3D space as an example, we want to quantize Our proposed method is as follows. First, we choose the dominant element by comparing the most significant bits of x j , which depends on the position of the first nonzero bit of x j starting from the left-most bit, bit n a − 1. In other words, we find the significant bits of x j by counting the number of leading zeros in the binary representation of , the smaller the number of leading zeros, the more significant. Let us denote the leading zeros of x j as lz j , and let lz m be the minimum of lz j ; i.e., lz m = min ∀j z j . Then, we find b j of x bin using the rule described in Eq. (18): Although, theoretically, this method maps the direction of x k to one of 2 N − 1 possible directions, the mapping does not perform well in practice for large N because using the same z m in Eq. (18) globally leads to some locally large x j being quantized to 0. We address this issue by grouping the x j 's into a group of three. Using the 6-dimensional 34,17,45,22) T with an 8-bit representation of x j as an example, we find that its x bin is (0, 1, 0, 0, 0, 0) T , as x 2 = 126 dominates. This is not a good estimation of the direction of x k because while the actual vector x k has values in all dimensions, the estimated vector x bin lies in one dimension. To mitigate the detrimental effect of large N , we propose to quantize each group of x k separately. The nimberof elements in each group is 3 or small number, as a rule of thumb. For example, by separating the x j s into two groups of three, x 1 k = (56, 126, 34) T and x 2 k = (17, 45, 22) T , where the superscript indicates the group number, we find that x bin = (0, 1, 0, 0, 1, 0) T as x 1 k and x 2 k are quantized to (0, 1, 0) T and (0, 1, 0) T , respectively. The modified BinarizeX(x) for finding x bin is described in Algorithm 2.

D. GEOMETRY-BASED APPROXIMATE CONVOLUTION
The geometry-based approximation of the dot product of vector x with a constant vector h is used as a core part of the proposed implementation of the convolution computation. Since the precomputed parameters P 1 and P 0 are based on generating all 2 − 1 cases of x bin of size , we cannot practically apply the method to a large vector size. To address this limitation, we use the superposition property of the dot product computation; i.e., the dot product of two vectors of the same size is the sum of the dot products of their corresponding subvectors. Applying this property to compute y[k] = h · x k , we partition h and x k into Q parts of size as described in Eq. (19) and (20) %G is the number of groups 4: % R is the number of last group's elements 6: x bin ← (0, 0, . . . , 0) T 7: for j ← 1 to 3 do 10: Then, y[k] = h · x k can be described by Eq. (21): Here h i and x k,i are the vectors of elements taking from the ith part of h and x k , respectively, for i = 0, 1, 2, . . . , Q − 1: Vectors h Q and x k,Q contain the remaining R elements at the end of h and x k , respectively: As a convention, if R = 0, we define h Q = x k,Q = 0. By computing each dot product h i · xk i using the method proposed in the previous subsection, we can describe the dot product h · x k for computing y[k] in Algorithm 3. x 0 = (0, 0, . . . , 0) T

8:
% elements just being removed with N zeros 9: for k ← 0 to M − 1 do 10: 11: % Compute |xk i | 16: if R > 0 then 27: xk 28: With this setup, we arrange the computation of y[k] = h · x k in 3 parts. The first part computes the magnitude of xk i , subvector i of the input vector.
As described in the problem analysis section, we compute the square of the magnitude of x k+1 using the relationship of two consecutive input vectors, x k+1 and x k , representing respective sliding windows of the input samples. Since, in the proposed procedure, each input vector x k is partitioned into Q parts of size , we have to compute |xk i | 2 of each part i at time k (line 14). Because xk i at clock k and k + 1 are two vectors from a sliding window of size , the computation of |xk i | 2 follows Eq. 9, in which we need (1)  The second part estimates the angle of vector xk i using the function FindAngle described in Section IV-B. In the final part, the results from the first two parts are used to compute yk = h i · xk using geometric definitions, and then, yk is combined with the current y[k]. Note that the first two parts can be done in parallel, which is suitable for the hardware implementation described in Section V. Although the procedure computes y[k] for k from 0 to M − 1, it can easily be adjusted to compute N additional samples to flush out all samples of x * h.

V. HARDWARE IMPLEMENTATION OF THE PROPOSED CONVOLUTION
Although, in general, the proposed approximate convolution computation can be implemented in many computation platforms, the proposed procedure is targeted for an embedded environment in which the computing resources and available power are limited. To this end, we propose a hardware architecture targeting FPGA or ASIC. Fig. 5 shows the overall datapath architecture of the hardware implementation for computing x · h of size . Note that the size is limited since in the computation of the parameters P 1 , P 0 , and B following the proposed method, we need to generate all 2 − 1 cases of x bin . We show in Section VI that the suitable is 20.
In this architecture, the streaming x[k] at the input port x is fed to two parallel parts for computing |x k |, the signal xmag, and cos θ k , the signal cos_theta. The results for these two parts are multiplied together with |h| in the final part to produce the streaming y[k] at the output port y. All |h|s are precomputed and stored in a lookup table (LUT) denoted as hmagTab in the diagram. If the size N of h (the number of tabs) is larger than , this hardware will be used to produce xk i ·h i , i = 0, 1, . . . ,Q−1, whereQ = N/ is the number of sections. The results of the xk i · h i s from all sections are added together to produce y[k]. This modular design allows us to implement the system based on the available resources. If we have sufficient resources, multiple instances of this architecture can be implemented to produce multiple xk i · h i in parallel. Otherwise, a single instance can be controlled to produce y[k] everyQ cycles of the system clock. As a result, in this minimal hardware case, the maximum streaming rate is f c /Q, where f c is the system clock's frequency.

A. MAGNITUDE COMPUTATION HARDWARE
The computation of x k 's magnitude follows Eq. (9); i.e., , the sample entering the window, is the new element of x k and x[k − ], the element leaving the window, is the last element of x k−1 . As shown in the upper part of Fig. 5, to implement the magnitude computation, we need the following components: (1) two multipliers or specially designed circuits of the square operation for computing x[k] 2 and x[k − ] 2 , whose results are the signals x1_sq and x0_sq, respectively, (2) two adders for the operation of a + b − c, (3) a register for storing |x k−1 | 2 , the signal xmag_sq_p, (4) cascading nbit registers for storing x[k − ], the element leaving the window, and (5) a square root unit.
Note that the incoming data sample, x[k], the signal x, is the newest element, whose square, x1_sq, is added with the previous magnitude, xmag_sq_p. It is also fed into the cascading n-bit registers, causing a delay of steps; i.e., its output is x[k − ], whose square, x0_sq, is subtracted from the sum of x1_sq and xmag_sq_p, resulting in the square of the magnitude at cycle k in the signal xmag_sq. Finally, the sum of |x k−1 | 2 , x[k] 2 , and x[k − ] 2 is fed into a square root unit for streaming out |x k | in the signal xmag.
This architecture is suitable for implementation with the pipeline technique, allowing simple control of the system by simply pumping the data in and out at a constant rate. To this end, a pipeline design technique can be applied for a specific hardware target with the constraint that the number of stages of the pipeline matches that of the cos θ implementation.

B. ANGLE ESTIMATION HARDWARE
To gain the benefits of the proposed convolution procedure, we need to design specific hardware for the approximation of cos θ k . The first specific circuit shown in Fig. 6 generates a vector x bin for each vector x k . The circuit takes a group of 3 consecutive samples that are converted to 3 respective bits of x bin . At the entry of the module, an n-input priority encoder circuit is adopted as the leading zero computation of the entering sample x, the x[k] sample. Its result is then fed to the cascading registers, named LZ0, LZ1, LZ2, . . . , LZ{ − 1}. As the outputs of these registers are the leading zeros of consecutive samples, the minimum of the 3 registers' outputs is found by the module LZmin. Finally, the stored leading zeros are compared with their respective references from the LZmin modules to generate the binary elements of x bin stored in the output registers.   The generated x bin is fed to the second specific circuit shown in Fig. 7 for computing x dot = x bin · h, denoted as xdot. For conventional implementation of the dot product, the number of additions required for the computation of x dot is on the order of Θ( ). In the proposed implementation, the module, denoted as XdotGen, does the job in the order of Θ(n), where n is the number of bits representing h[j], an element of h; i.e., the order of computation is constant with reference to the problem size, which is specified by N and M , the number of filter tabs, and the number of data samples. To explain the proposed implementation, let us define x bin and h as follows: Then, as described in Eq. (29), the computation of x dot can be expressed as a summation of c j 2 j , j = 0, 1, . . . , n − 1, where the coefficient c j =  Based on this analysis, we propose a design of a specific circuit, as shown in Fig. 7, that produces c j 2 j , where , for the input x bin , provided that h is known. This design is modular and is parameterized with and n. By storing bit j of all tabs as -bit data, H j = (a to produce c j . This is followed by the module shL{j}, which implements c j 2 j by shifting c j to the left by j bits, which can be done by wiring for a given j. The results, denoted as c0x1, c1x2, . . ., c{n − 1}x{w} (w = 2 n−1 ), are added together in the module adderTree, which employs a tree structure of n − 1 adders to implement n−1 j=0 c j 2 j , resulting in x dot . The estimated θ k is produced by the module LinearFunc, which implements the computation of P 1 x dot + P 0 − B, whereas P 1 , P 0 , and B for a given h are stored in the module ParameterTAB. Finally, cos θ k is produced by an LUT storing the cosine function.

C. APPLICATIONS TO 2D CONVOLUTION
Our hardware is considered a scalable array of processing elements (PEs), which are the base elements for building the applications. In this section, we explore applications of our hardware blocks to 2D convolution. First, we design a 2D engine based on our scalable hardware. Then, we apply the developed hardware to a state-of-the-art convolutional neural network.

1) Two-Dimensional Convolution
To develop a 2D geometry-based convolution, we design a specialized, low-complexity, low-power-consumption hardware. Let x denote the input matrix of size M × N and w denote a filter of size n×n, where n is a positive, odd number, i.e., n = 2a + 1 for some non-negative integer a. 1 . Elements of the output matrix y from the 2D convolution between x and w are given by where i and j are row and column indices, respectively. As i and j vary, the filter's center visits every pixels of x once. The 2D convolution can be considered as a summation of 1D convolutions. Therefore, a hardware for 2D convolution can be obtained from our 1D-convolution hardware. Based on our hardware in Fig. 5, the overall architecture for 2D convolution is shown in Fig. 8. The architecture distributes each row of the input matrix into the row of FIFO and parallelizes the computation by interleaving matrix rows over the PEs. Every PE stores a partition of weights, i.e., the convolution constants, in the BRAM and performs a computation of row convolution with those values. Finally, the results of each row from each convolution are added together using a parallel tree-structured hardware in the addX unit in the figure. The proposed architecture has m levels of parallelism, where m is the number of 1D convolutions, which is also the number of PEs parallelized in the engine. The notation GC2(m) in the figure denotes the overall 2D convolution engine. In the proposed architecture, the more PEs, the more levels of parallelization.
The number of PEs in the proposed architecture should be a multiple of n, i.e., m = βn, where β is the amount of output per cycle, so that the proposed hardware can simply  For illustration, we let β = 1. To generate the first output row, the first m rows of the matrix, which are row 0 to row m − 1, are streamed and split into each PE for computing the 1D convolution. After that, the 2D convolution is performed by adding those results together in the unit labelled addX in the figure. When the first output row is finished, the second output row is computed by moving the data to the FIFO queue, i.e., fetching row 1 to row m of the input matrix to each GC unit. In the remaining steps, the input matrix will be scheduled in the same fashion, by moving the input matrix row by row until the final row is added to the FIFO queue. This procedure is handled by the unit row schedule in Fig. 8.

2) Convolutional Neural Network
This subsection applies 2D convolution to CNN models. Given that most CNN architectures use the convolution in a similar way, we select the AlexNet CNN model as an illustration. This subsection focuses on applications of our purposed method to complex CNNs. We begin on how a 2D convolution is embedded in a CNN. For illustration, we consider a 4D kernel tensor K, where element K i,j,k,l is the connection between a unit in channel i of the output and a unit in channel j of the input, with an offset of k rows and l columns. Furthermore, we let V denote the input data, where element V i,j,k is the value of the input unit at row j, column k, and channel i.
We consider a simple case in which the output Z takes the same format as the input V and comes from a multichannel convolution: where the summations cover all indices l, x and y. Parameter s is the stride of the convolution. Eq. (31) captures the main According to Eq. (31), each output Z i,j,k is the sum of multichannel 2D convolutions with respect to the number of filters in each layer and the stride s. To develop a convolution layer, we propose to use the hardware in Fig. 9 instead of the conventional convolution hardware, for less power consumption and less complexity. The hardware in Fig. 9 consists of many 2D convolution GC2(m) units.
In Fig. 9, the proposed architecture takes a stream of multiple layers of V, together with the configuration parameters, which consist of the current number of filters and the number of layers. Each channel of input V is interleaved into each PE in a scheme of 2D. Meanwhile, the PE clusters execute the convolutions. The weights and stride s are self-distributed to each PEs with respect to the sequence of computation. After that, the stride selection hardware is operated, for downsampling the convolution. The stride selection is built from the counter. Finally, the parameterized parallel adders PAddX are operated with respect to the final summation of the 2D convolutions, as indicated by Eq. (31). The final results of multichannel convolution are stored in the BRAM, and the hardware prepares to compute the convolution for the next layer.
To control the datapath, a conventional unrolling loop is used. The number of unrolling loops depends on the number of PEs in the field. The number of groups for parallel adders also depends on the number of filters according to Eq. (31). Moreover, the address generator block fetches the desire address to acquire the data from the BRAM. This core has two parameters: m, which is the number of 1D convolution GC; and p, which is the number of 2D convolution GC2. The core performs the operation convL(p,m) as shown in Fig. 9. Our hardware blocks are general and able to facilitate the architectures of 2D convolutions and CNNs.

VI. RESULTS AND DISCUSSION
The proposed approach is evaluated in three respects: computation accuracy, time complexity, and resources, measured in both area and power consumption. VOLUME 4, 2016

A. ON ACCURACY
We evaluate the accuracy of the proposed approximated convolution in two steps. In the first step, we consider the errors of the approximated angle,θ, following Eq. (17), compared with the exact angle, θ, following Eq. (10). Then, the accuracy of the approximated convolution results is considered. Both evaluations are performed with the parameters shown in Table II. For the first result, the effect of vector-to-angle approximation is considered using a random vector as h, the impulse response. The percent angle errors for the vector size from 4 to 20 are plotted in Fig. 10. As expected, the percent error decreases for larger vector sizes as the space to which x bin is quantized grows exponentially with respect to N , the vector size. The important finding in this evaluation is that the error is saturated for sizes greater than 18. These results are used to choose an appropriate value for in our proposed method, as the price for a large value of is in the process of predetermining the values of P 1 , P 0 , and B. Based on these results, we choose = 20.
Using the chosen = 20, we evaluate the accuracy of the proposed convolution approximation for 4 different kinds of filters, a uniform random filter, a Gaussian filter, a low-pass filter, and a high-pass filter with vector size N varying from 4 to 100. Using randomly generated samples of an input vector x of size 400, we compute the average absolute error of the approximated convolution using the equation below: where h * x is the precise convolution and approx(h * x) is its approximation using the proposed method. Fig. 11 shows the plot of the average absolute errors for various filter types, with the filter size varying from 4 to 100. In this plot, for each filter size, we average the percent error using 200 samples of h, each of which is evaluated with 1000 samples of x. The plot shows a very good result, as the average error of the proposed method is approximately 5% for specific filter types of size greater than 20. Moreover, the larger the size is, the smaller the error. For the random filter type, the error is larger than that for a specific type due its randomness. Hence, using this as an upper bound error, the proposed approximation achieves less than 5% error for any filter of size greater than 50. To evaluate the effect of the approximation on the quality of the output signal, we feed a sinusoidal signal with the fundamental frequency and its third harmonic and a sawtooth with the 3rd harmonic sinusoidal signal to a Chebychev lowpass filter with the order of 20. Fig. 12 shows the output signals from the filter obtained using the proposed approximated convolution compared with those using the actual convolution. Although the results reveal some local distortions of the approximated output from its precise version, the maximum error is 17% and 14%, while the average error is less than 4.5% and 3.7% for sinusoidal and sawtooth signals, respectively. Furthermore, we also evaluate and report the accuracy in Table III among the various types of approximate multiplication on the same comparable test case. Evaluation metrics include the error rate (ER), the normalized mean error distance (NMED) and the percentage of mean relative error distance (MRED) [44]- [47]. The ER is the rate of h * x − approx(h * x) = 0. The NMED is the average of |h * x − approx(h * x)| divided by (2 n − 1) 2 . The lower these evaluation metrics, the more accurate the approximation method. From the table, the proposed method performs just above average among the other types for both NMED and MRED. This is to confirm that the proposed method is suitable for convolution tasks. Next, we quantify the effect of approximate 2D convolution on image filtering, using the following procedure. We apply two types of filters to images in the benchmark  datasets. The two types of filters are the Gaussian filter, which represents a low-pass filter, and the Laplacian-of-Gaussian filter, which represents a high-pass filter. The benchmark datasets are the MNIST [50], CIFAR [51], and InDoor Scene [52] datasets. Each type of filter is applied twicefirst, using an exact 2D convolution and, second, using the approximate 2D convolution proposed by the paper-to each benchmark image. We use the peak signal-to-noise ratio (PSNR) to measure the difference between a filtered image from an exact convolution and the counterpart from an approximate 2D convolution, where PSNR = 20 log(255) − 10 log(the mean square error) dB. The PSNR of at least 20 dB [53] generally indicates a reasonable agreement between the exact and the approximate filtered images. Table IV shows the average PSNR of each dataset with respect to each filter type. The PSNRs are reasonably large, indicating an agreement between the exact and the approximate convolutions. For a given dataset, the PSNR due to the Gaussian filter is larger than the PSNR due to the Laplacian-of-Gaussian filter. When using the approximate 2D convolution, the Gaussian filter produces images of better quality than the Laplacian-of-Gaussian filter does. Figs. 13  and 14 show examples of filtered images from the exact and approximate 2D convolutions, for the Gaussian filter and the Laplacian-of-Gaussian filter, respectively. In each figure, the PSNR between the exact-convolution and the approximateconvolution images is approximately 25 dB. Visually, the two images in each figure are indistinguishable by eye. The proposed method of approximate 2D convolution attains a reasonable level of accuracy.
We further evaluate performance of the approximate 2D convolution in a neural network. As explained in Section V-C2, we modify the convolutional layer, replacing the exact 2D convolution with the proposed geometry-based approximate convolution. For illustration, the datasets are the ImageNet and MNIST datasets. Table V summarizes classification accuracy of the proposed method, compared with the exact convolution. From the table, classification accuracy of the approximate convolution is smaller than that of the exact convolution, as expected. The difference in classification accuracies of the exact and approximate 2D convolutions is between 5 to 15 percentage points for the MNIST dataset and between 15 to 40 percentage points for the ImageNet dataset. The loss in classification accuracy depends the number of convolutional layers that are being approximated. The MNIST dataset has only 10 classes, which simplifies the tasks of training the AlexCNN model and of classifying a sample. On the other hand, the ImageNet dataset has thousands of classes, which are more complex to classify, hence reducing the classification accuracy when the convolution is approximated. A loss in classification accuracy is offset by a gain from simple hardware and efficient power consumption, as discussed in the next section.

B. ON PERFORMANCE
In this section, we evaluate the number of basic operations, including addition, multiplication, and square root, needed in the proposed convolution implementation, compared with the conventional convolution and other implementations. Furthermore, we evaluate the throughput, which is the amount of output in a given period. To be specific, the complexity comes from the convolution computation only because the complexity of the preprocessing step is fixed for the given, fixed impulse response and diminished in a long run as the size N of the streaming input is large. Another reason to safely omit the complexity of the preprocessing is that preprocessing can be computed before the convolution implementation.
Recall that N and M denote the sizes of vectors h and x, respectively. For the proposed method, to compute M points of h * x, there are three parts of the computation as follows: (1) For the computation of |x| 2 , the proposed method needs N multiplications and N − 1 additions for computing |x 1 | 2 = N j=1 x 2 j . Then, for k = 2, 3, . . . , M , only one multiplication and 2 additions are needed for each |x k | 2 . In total, the proposed method needs N + M multiplications and N +2M −3 additions for computing |x| 2 in all M points of h * x.
(2) For the computation of cos θ k , k = 1, 2, . . . , M , the proposed method eliminates the multiplications by using x bin , the binarized x k . Furthermore, we reduce the number of additions in computing x bin ·h with a specific  [16] Addition [42] Square Root 0 Fastfood and FFT Addition M log 2 M Convolution [19] Multiplication [20] Multiplication N + 3M Square Root 0 circuit, as described in Section V-B. With the direct implementation, x bin · h needs N avg M additions, where N avg is the average number of nonzero bits in x bin . However, with the proposed hardware implementation, the number of additions is a constant, which is equal to the number of bits of the data representation. This reduction is traded off with two additional circuits: (1) a circuit for generating x bin (Fig. 6) and (2) a circuit for determining the total weight of each bit (Fig. 7). (3) Since the proposed method computes the square of y k based on the computation of |h| 2 |x k | 2 cos θ k , it needs 2M multiplications and M square root operations in total.
In total, for each computation of h * x, the proposed method needs N + (2 + w)M additions (w is the number of bits representing h j ), N +3M multiplications, and M square root operations. Table VI also compares the number of basic operations of the proposed method with those of the conventional convolution procedure with constant multiplication enhancement, conventional convolution with approximate multiplication, and the Fastfood approximation method [19]. Given N , the size of vector x, and M , the size of vector x, the proposed method provides linear growth of Θ(N + M ) for both addition and multiplication compared with Θ(M N ) of the conventional and Θ(M log 2 M ) of Fastfood approximations, FFT and Winograd. The convolution with constant multiplication enhancement has only polynomial growth in addition, which is Θ(M N ), and the convolution with approximate multiplication has the same results as the conventional method because it has the same operation. However, the proposed method needs additional square root computations with the growth of Θ(M ).
The proposed method and the Winograd convolution have the same number of the multiplication operations, asymptotically. However, the number of addition operations is smaller in the proposed method. The number of multiplication operations for the Winograd convolution can be reduced to M + N − 1 by storing the transform matrix in memory, an approach that could cause a memory issue in the embedded system. In our design, we store three more values of parameters than the conventional method does, leading to less memory consumption than the Winograd. At the same level of time complexity, our method is easier and simpler to implement than the Winograd convolution.
In addition to time complexity, we evaluate the throughput of the core. Firstly, the throughput and latency of 1D and 2D convolutions are investigated. The throughput depends on the system clock, which is one sample per clock in our hardware architecture and other computational designs. The latency depends on the size of vector h, which generates each sample of the output y according to Eq. (2). From Eq. (2), the hardware needs to wait until the current data x k were completely streamed before it can stream the new window of data x k+1 . Therefore, the throughputs of the proposed method and the conventional method are equal to one sample per clock. The latencies of the proposed method and the conventional method are equal to the system clock interval multiplied by the size of vector h. The proposed method is on par with the conventional method in terms of the throughput and latency. Furthermore, we investigate the effect of our proposed method on the execution time of a CNN. The total execution time, T CNN , of the complete CNN is given by Here, T CL,i is the execution time of the ith convolutional layer, n CL is the total number of convolutional layers, T FL,i is the execution time of the ith fully connected layer, n FL is the total number of fully connected layers, and T MEM is the memory loading time for loading an image or the results from the previous layer. To quantify an effect of approximate convolution, we consider only the total execution time of the convolutional layers, T CL , in our proposed method.
The number CY of cycles for each convolutional layer is determined by the number of 2D convolutions: where nConv is a product between the number k of kernels and the number l of channels, i.e., nConv = k × l. In the above equation, Is is the size of input, s is the stride, Ks is the size of kernel, m is the number of 1-dimensional engines, and p is the number of 2-dimensional engines in the core.
The total execution time of the convolutional layers equals which is a sum of the convolution execution time and the memory consumption time.
To compare the execution times of the CNN, the conventional, exact convolutions are also implemented on the same hardware architecture. The differences between the exact and approximate convolutions appears in the stride configuration. Our proposed method stalls on the number of strides, multiplied with the system clock to acquire the input. In contrast, the conventional convolution does not stall, leading to performance degradation. According to Eq. (34), our method has a constant value of stride equal to one, but the conventional method follows the CNN architecture.
The Fig. 15a show the complexity reduction of the proposed method compared with the conventional method. The proposed method can reduce the complexity of the conventional method by 70% and 18% approximately in the first two layers, highlighted blue in the figure, respectively. In the last three layers, the complexity increases 18% approximately. Overall, the proposed method reduces the total complexity of the conventional method by 22% approximately.
We further test the proposed architecture on a classic benchmark CNN, the AlexNet, and measure how many images the proposed architecture can produce in one second, compared to the conventional convolution. The AlexNet takes 227 × 227-pixel 2 images as input. In the test, the size of kernels, channels, and input for each convolution layers are standard and follow the specifications of the AlexNet CNN model. The maximum size of kernel in AlexNet is 11 × 11. Hence, the choice of m = 15 simplifies the distribution of the data to the core and is suitable in the proposed architecture. Fig. 15b shows the number of images processed in one second for the conventional and proposed methods. The conventional method processes approximately 10-15% more number of images in one second than the proposed method does. A drop in the speed, measured by images/second, is a trade off for better power consumption and complexity, as described in the next section.

C. ON RESOURCES
In this section, we evaluate the cost of achieving the performance level in terms of the operation counts described in the previous section. As the proposed convolution procedure has been targeted as a hardware implementation on either a programmable hardware chip, an FPGA chip, or an ASIC chip, the cost is measured in terms of area and power. For the FPGA implementation, we tested the hardware implementation described in Section V on the Zynq 7z010 FPGA fabric [48]. After verifying the correction of the implementation, we measured its power consumption using the tools provided with the development kit and recorded the implementation's area in terms of resource utilization, as shown in Tables VII and VIII, respectively.   2  30  10  21  9  3  33  20  4  11  35  33  35  15  22  10  3  35  20  4  16  40  38  40  20  22  10  3  35  20  4  20  44 40 42  5  874 153  3  200  82  5  349 108  0  280 108  4  10  901 153  3  220  82  10  458 140  0  332 140  4  15  928 153  3  250  82  15  501 156  0  406 156  4  20  954 153  3  280  82  20  556 179  0  495 179  4 For the power consumption issue, the results show that for a large practical filter size, the proposed method can reduce the total power consumption by approximately 20% compared with that of the conventional method. Because the proposed hardware architecture is designed to handle any size N of h by arranging the computation in multiple blocks of 20, the power needed to operate the proposed hardware is approximately the same for any filter size N ≥ 20. However, the energy consumption will grow with respect to the vector sizes N and M . For the proposed implementation, this growth of energy is linear, following the growth of the operation counts discussed in the previous section.
For resource utilization, the implementations are coded in HDL and synthesized by the Xilinx Synthesis Tool that comes with Vivado 2021.2. The results show that the proposed hardware utilizes more hardware resources than other implementations. These results are as expected due to the tradeoff we make to reduce the power consumption.
However, in ASIC, we can design the proposed hardware architecture with significantly less area because all circuits are specific, while in FPGA, all resources are prebuilt so that they can be programmed. To illustrate this point, we synthesize our proposed method and the conventional version targeting an open-source 45-nm [49] (FreePDK45) PDK standard cell using Synopsys DC compiler for the purpose of evaluating the area and power consumption of the proposed hardware in ASIC, as shown in Table IX. According to Table IX, for the power measurement, the proposed method operates at 602.16 µW compared with the 803.76 µW spent by the conventional counterpart, which is an approximately 25% difference. For the area, the ASIC synthesis results show that the proposed hardware footprint (5337µm 2 ) is approximately 86% the size of that of the conventional method (6157µm 2 ), which means our proposed method needs approximately 14% less area. These results show that we can achieve a lower cost in terms of hard-  ware cost, area and power consumption in ASIC. This is possibly because the proposed hardware, which requires specific circuits in the estimation of cos θ k , is more suitable for specific circuit implementations in ASIC than the programmable hardware in FPGA. Comparing to other approximate methods, the proposed method consumes less power, approximately 10%-15% less, while utilizes more area with the same accuracy, which is the result of operation reduction. In addition, the proposed hardware is designed using the pipeline technique, which allows the streaming of results in every clock cycle. Next, we measure the resource and power utilization for specific applications. All designs are synthesized under the same environment, which is the Synopsys DC compiler with a 45nm FreePDK. Firstly, we begin with the 2D hardware GC2. Its utilization is determined by the number of GC units and the number of FIFOs used to buffer each row of the input image. The number of FIFOs in the proposed method depends on the size of the image filter, in order to simplify the process of data scheduling. We test the proposed method with difference filter sizes, while keeping the input image size fixed at 227 × 227-pixel 2 , which is the same image size in our CNN evaluation. The result is shown in Fig. 16. From the figure, our proposed method reduces the area and power consumption of the conventional method by approximately 5% and 25%, respectively. Finally, we measure the utilization and power consumption of the CNN engine. The AlexNet CNN model is used for testing the architecture, with m = 15 and p = 10. The result is shown in Table X. The proposed design has a smaller area and consumes less power than the conventional method. Power consumption of the proposed method is 13% lower than the conventional method. The proposed method outperforms the conventional method in terms of the area and power consumption.
For three different matrices, the power-accuracy and areaaccuracy tradeoff graphs in Fig. 17a and Fig. 17b show that the proposed design has a good tradeoff for both area and power. For the area-accuracy tradeoff, there are some designs  [34] 5169.77 706.36 Venka [35] 5166.76 702.38 Akbar [38] 5047.18 695.93 Sabetz [39] 4846.58 686.69 Ahma [40] 4927.24 686.46 Yang1 [36] 6134.77 816.09 Yang2 [36] 6085.78 808.06 Yang3 [36] 5802.12 791.88 Lin [37] 6029.05 807.71 Ranjbar1 [41] 5252.12 716.56 Ranjbar2 [41] 5100.72 699.25 Rangbar3 [41] 5136.82 696.91 Kong [42] 5332.80 718.6 Proposed 5337.14 602.16 that use less area and gain more accuracy. Nevertheless, our design is above average among the designs. Additionally, our proposed method has the best tradeoff for power consumption among the various approximate multipliers. Moreover, for applications in image filtering and neural networks, the accuracy of the proposed method is acceptable. Our proposed method can reduce the number of operations while trading off accuracy and throughput with the power consumption. Operation reductions directly affect power consumption. These characteristics make our proposed method a favorable choice for an energy-efficient design.

VII. CONCLUSION
In this article, the implementation of convolution computation to achieve less power consumption with the limited resources of embedded applications along with better performance is proposed. Trading off the computation accuracy, we propose an approximate convolution procedure based on the geometric interpretation of the dot product by approximation of cos θ. All state-of-the-art convolution methods, approximate convolution methods, and the proposed method are implemented on FPGA and the 45 nm FreePDK CMOS process with Vivado 2021.2 and Synopsys DC compiler to investigate the area and power utilization. The experimental results indicate that our design has a 25% power reduction and area reduction compared with exact convolution and an approximately 10%-15% power optimization compared with other approximate methods. We also verify that our method supports 2D convolutions and has an acceptable classification accuracy for CNN models, where we take the AlexNet CNN model as a benchmark example. We also quantify the time complexity of the proposed method. The proposed approximate convolution and the corresponding hardware architecture have applications to inference and signal processing tasks that tolerate some errors in the convolution and require low power consumption.