Approximate Recursive Multipliers Using Low Power Building Blocks

Approximate computing, frequently used in error tolerant applications, aims to achieve higher circuit performances by allowing the possibility of inaccurate results, rather than guaranteeing a correct outcome. Many contributions target the binary multiplier aiming to minimize the complexity of this common yet power-hungry circuit. Approximate recursive multipliers are low-power designs that exploit approximate building blocks to scale up to their final size. In this paper, we present two novel 4×4 approximate multipliers obtained by carry manipulation. They are used to compose 8×8 designs with different error-performance trade-off. The final circuits exhibit a competitive behavior in terms of error while reducing the power dissipation when compared to state-of-the-art proposals. The proposed multipliers and state-of-the-art designs found in the literature, have been synthesized targeting a 14nm FinFET technology to determine the electrical characteristics. Compared with an exact 8×8 multiplier, the least dissipative design proposed in this paper reduces power consumption and silicon area by 46%, and minimum delay by 21%. It also consumes 14% less power than the least power-hungry recursive circuit found in the literature, while offering 81% higher accuracy. Ιmage processing applications and a convolutional neural network are shown to demonstrate the effectiveness of the proposed multipliers.


I. INTRODUCTION
The everlasting demand for power and speed improvement has driven researchers towards approximate computing. Approximate computing is a fast-emerging field in digital design that sacrifices the exactness of computations over significant improvement in power dissipation, speed, and circuit area. Such techniques can be utilized in cloud computing, embedded and mobile devices, where high speed and power minimization are important constraints. Approximate computing finds fertile soil in error resilient applications such as multimedia processing, data mining and recognition, machine learning [1]- [4].
Concerning approximate computing, a plethora of studies has focused on arithmetic operations, such as binary addition, multiplication, and division [5]- [7]. Binary multipliers constitute a fundamental part of digital processing systems, and unfortunately are characterized by heavy silicon area, power, and timing requirements [8]. Consequently, nowadays approximate binary multipliers are being studied thoroughly. A comprehensive survey of arithmetic circuits, such as approximate adders, multipliers, and more complex circuits such as the binary divider is reported in [9].
Several techniques providing efficient approximate multipliers have been studied in the literature. One such example is the approximate logarithmic multiplier [10]- [12]. In this case, approximated versions of the logarithms of the input operands, are added. The result corresponds to the approximated value of the antilogarithm of the sum. These are low power and high speed designs, due to the low complexity in their architecture. However, they tend to be less accurate. Another approach is the static segmentation. In this technique, a part of each input operand is given as input to a small multiplier, whose shifted output is the result of the multiplication [13]. Static segmentation has been demonstrated to be useful when very low power is needed, and accuracy is not the main issue. In [14] the authors propose an approximate multiplier that can dynamically control accuracy. The circuit can select the length of the carry propagation to effectively satisfy the desired accuracy requirements.
Software-based approaches have been proposed, that merge the approximated multiplier design in the design flow of the circuit. They automatically generate synthesizable hardware description code for approximate arithmetic circuits based on the accuracy requirement of the design [15]- [18]. Such techniques can prove useful when the targeted application does not have a uniform input distribution.
The basic binary multiplication process can be divided into three parts: partial product generation, partial product reduction and carry-propagate addition. Approximate computing can be introduced in all these steps. For instance, the first step can be approximated by truncating some of the least significant partial products (PPs) and then employing a compensation strategy [19], [20].
The partial product reduction step is typically the main target for approximations in a binary multiplier. A common approach to reduce the partial product matrix (PPM) relies on the use of approximate compressors. Compressors are logic circuits that aim to minimize the number of operands in the final step, which is the addition of the reduced partial products. They are XOR-rich circuits (thus slow and powerhungry), that count the number of ones in the input. The most basic exact compressor is the Full Adder, that reduces three digits into two, maintaining the original information. Many research contributions have focused on the approximation of the PPM compression phase, [21]- [35].
In [21] the authors acquire approximate compressors by truncating outputs of some exact compressors, while in [22] and [25], compressors with only 2-bit outputs are proposed. Lossy compression of the rows in the PPM based on bit significance, is investigated in [23]; the compression exploits approximate, OR-based half adders. In [24] simple OR gates serve as approximate compressors and two designs are proposed. The two designs are obtained using encoded partial products and approximate compressors, delivering different accuracy-electrical performance trade-off. Several solutions employing 3:2 and 4:2 compressors to generate approximated multipliers are presented in [26], [29], [31], [32]. A set of Single-Weight Approximate Compressors (SWACs) is employed in [27], to construct approximate multipliers. Unlike the Full-Adder that produces a sum and a carry, these designs compress input bits derived from a PPM column, into fewer output bits, maintaining the same initial weight. This allows a significant reduction of circuit complexity since less carry bits are generated and propagated. Maddisetty et al. [28] present the training of a neural network to devise an efficient approximate 4:2 compressor. In [30] two 4:2 compressors are presented; a novel 4:2 architecture, and a modified design by substituting the AND / OR gates with NAND / NOR gates respectively. Although the boolean expression is changed, when the modified version targets multipliers, employing reduction steps in multiples of 2, the difference is nullified. Approximate 4:2 designs implemented in FinFET technology are presented in [33], [35]. In [34] the number of outputs of the approximate 4:2 compressor is innovatively reduced to one; 3 such compressors are proposed, as well as an error-correcting module.
Recursive multipliers are an interesting research area of the approximate computing field that aims to use small elementary approximate multiplier blocks, suitably assembled, to design larger multipliers, [26], [36]- [42]. The advantage of the recursive building of larger multipliers is that it avoids a dedicated design for every bit-width and gains in terms of generality of the proposed approaches. As explained in [26], four nÂn building blocks can be utilized to scale up to a 2nÂ2n multiplier. Several authors have used 4Â4 approximate multipliers to recursively generate several 8Â8 multiplier alternatives. The authors of [26] propose three 4:2 compressors, used to generate two 4Â4 multipliers.
Guo et al. [38] propose a 4Â4 approximate multiplier module. The corresponding 8Â8 multiplier is made up from one 4Â4 multiplier featuring OR-based compressors with no carry propagation in the lower part, two of the proposed 4Â4 modules in the middle part, and an exact 4Â4 multiplier for the most significant part. Differently from the other designs, the four products are summed using an approximate adder.
In [40] the authors consider the probability distribution of the input operands to propose 4Â4 multipliers, consisting of approximate NOR-based half adder and full adder designs. These elementary blocks are exploited to build approximate recursive multipliers. In [41] a 4Â4 approximate multiplier featuring an error detection and correction system, is presented.
Similarly, in [36], [37], [39] and [42] the authors propose 2Â2 approximate sub-multipliers, suitably arranged, to form larger size multipliers. Sixteen 2Â2 modules are needed to create an 8Â8 multiplier. Kulkarni et al. [36] present a 2Â2 inexact multiplier with tunable error characteristics. In [37] the authors provide an exploration of the architectural space and propose their 2Â2 module. The 2Â2 approximate multiplier presented in [39] has an internal self-healing strategy that does not require coupled modules, while the proposed larger multipliers derived from the 2Â2 blocks produce near zero mean error. In [42] two elementary multipliers are proposed that exhibit double-sided error distribution while the resulting 8Â8 design has the advantage of error compensation.
In this paper, two novel 4Â4 approximate designs with minimal power requirements and competitive error performance, are presented. The output is calculated by exploiting carry truncation and compensation techniques. These designs, along with an OR-based and an exact 4Â4 multiplier, are used to generate 8Â8, 16Â16, and 32Â32, approximate multipliers, following the strategies presented in [26] and [38].
The circuits proposed in this paper as well as various previously proposed contributions, have been synthesized using a commercial 14nm FinFET standard cell library. Syntheses show that our circuits, compared to previously proposed designs, provide good error-electrical performance trade-off.
We have also investigated the performance of 8Â8 approximate multipliers in image filtering applications and in the inference step of a pre-trained convolutional neural network. Obtained results confirm that the proposed circuits are good competitors in error-resilient applications.
The paper is organized as follows. In Section II the approximate OR-based and proposed 4Â4 multipliers are presented. The architectures of the recursive 8Â8 multipliers are in Section III. Section IV reports the performances of 4Â4, 8Â8, 16Â16, and 32Â32 multipliers. Section V shows a comparison with formerly proposed approximate multipliers for image processing applications and for the inference stage of a pre-trained convolutional neural network. The conclusions are drawn in Section VI.

II. 4Â4 APPROXIMATE BINARY MULTIPLIERS
Let us consider two 4-bit unsigned numbers a ¼ P 3 i¼0 a i 2 i and b ¼ P 3 j¼0 b j 2 j . The computation of their product, y ¼ P 7 k¼0 y k 2 k , consists of three steps. Firstly, the partial product matrix (PPM) is generated using AND gates between all the input bits. There are various techniques to carry out the second and third steps that reduce and sum the entire PPM to obtain the final product, e.g., employing full adders, half adders or 4:2 compressors in Wallace or Dadda configurations. Figure 1 shows the Wallace reduction tree for an exact 4Â4 multiplier. Three half adders (dashed rectangles) and five full adders (rectangles) are employed to reduce the PPM. The sum and carry outputs produced by half and full adders are indicated in the figure as SN_x, CNM_x, where N and M indicate the origin and destination column, while x indicates the reduction stage. After two stages of reduction we obtain the three least-significant bits of the output Y [2]. . .Y[0] and two 4 bit values that are summed to obtain the most significant bits of the output, Y [7]. . .Y [3].
A simple and fast way to approximate the product of two binary numbers is to use an approximate multiplier with OR compressors. In this case all the partial products in each column of the PPM are fed to OR gates as shown in Figure 2. As it can be observed the most significant bit is always zero. This approximated design is a kind of lower bound for circuit complexity but, as shown in Table 3, it exhibits the worst error performance.

A. 4Â4 APPROXIMATE MULTIPLIER N1
In the circuit shown in Figure 2, the sums of the partial products are approximated using OR gates. As a more accurate base point, we can assume an approximate multiplier that uses OR gates to sum the lower half of the matrix of partial products, and full or half adders for the higher part, as shown in Figure 3. Note that the approximated multiplier in Figure 3 requires three compression stages, while the OR based in Figure 2 obtains the result with a fast single stage.
The design in Figure 3 contains three XOR gates that are known to be bulky and slow. An attempt has been made to simplify it. The first step is to substitute the XOR gate in column 4 with a simpler OR gate: The next step is the manipulation of the carry of the same full adder: Let us simplify the expression by neglecting the last term: A customized Full Adder is employed in column 5 to add the three terms. The sum is exact and uses a XOR gate:  The carry can be significantly simplified: By neglecting the terms that are a product of three literals (they have a lower probability of being '1') we get: The two terms in column 6 are fed into a customized half adder. The sum is the XOR of the two inputs: Finally, the carry of the Half Adder is approximated as: The resulting design is named N1 and is shown in Figure 4. N1 uses three stages to reach the result and uses six OR gates, four AND gates, and one XOR gate. Compared to the exact Wallace 4Â4 multiplier, it shows a vast improvement in terms of both power and speed. In fact, in the exact design the third stage consists of cascaded half and full adders, resulting in three sub-stages, all of them containing at least one XOR gate. Namely, 28 AND gates, 8 OR gates and 12 XOR gates are used in the exact design. Obviously, the proposed design provides an inexact result. The error characteristics of the proposed blocks are discussed in Section IV.

B. 4Â4 APPROXIMATE MULTIPLIER N2
Let us now start from a less accurate circuit than the one in Figure 3. In this circuit, shown in Figure 5, all the terms from Y[0] to Y [4] are computed as the output of OR gates, while the remaining bits are computed without approximations. As shown in Figure 5, two half adders are needed together with the OR gates to complete the design of the multiplier. The proposed architecture takes the circuit in Figure 5 as a starting point for further simplification.
The first step is to substitute the XOR gate of the half adder in column 5, with an OR gate: The carry of the same half adder is: The sum of the last half adder is: By neglecting the second term: With a final approximation: The carry of the last Half Adder is: The resulting design is named N2 and shown in Figure 6. This rather simple design has only two additional AND gates with respect to the OR-based design shown in Figure 2. However, the performances of the proposed design are considerably better as will be discussed in Section IV and shown in Table 3, making this design useful for higher order multipliers.
As mentioned in the introduction, scaling up to a 2nÂ2n multiplier can be achieved by exploiting four nÂn multipliers. The same technique can be used recursively to design even larger multipliers. For instance, four suitably placed 2Â2 multipliers form a 4Â4 multiplier, while sixteen 2Â2 multipliers can be used to generate an 8Â8 design. Note that the building blocks do not need to be the same and different ones can be used, to obtain different electrical performanceaccuracy trade-offs. As a rule of thumb, if uniform distribution is expected for the input operands, exact or high precision modules should occupy the most significant portion of the design. Moving towards the least significant part, modules that are less accurate, but also less demanding in terms of resources, might be used.
Consider two 8-bit unsigned numbers a ¼ P 7 i¼0 a i 2 i and b ¼ P 7 j¼0 b j 2 j . In order to exploit recursive 4Â4 multipliers to calculate the product y ¼ P 15 k¼0 y k 2 k , each number is divided into two 4-bit parts: and a H b H are performed exploiting the corresponding blocks. Finally the four subproducts need to be added. As shown in Figure 7, the four sub-products are added employing an exact adder. Table 1 shows the circuits considered for comparison that apply this design methodology ( [26], [36], [37], [40] and [42]), the corresponding 4Â4 building blocks, and how they are used to build larger multipliers. The multiplier names reported in the Table are directly taken from the reference papers. Note that the 4Â4 approximate modules used in [42], namely mul2a4 and mul2b4, are also recursive multipliers made up by 2Â2 blocks. Table 1 also shows the composition of two of the four 8Â8 multipliers proposed in this paper, namely N8-5 and N8-6. They use the proposed N1 and N2 blocks solely in the least significant part of the multiplier and produce fairly accurate results. As it will be shown in the following, they overcome the state-of-the-art when compared with other proposals in the same error range.
The circuit proposed in [39], that is used as a comparison in this paper, is not shown in Table 1, since it exploits 2Â2 approximate multipliers to scale directly up to 8Â8, without proposing specific 4Â4 building blocks.
An alternative way to add the sub-products is proposed in [38] and used also in this paper. The utilized building blocks and their positions are shown in Table 2. Differently from Figure 7, the final product is not the exact addition of the four sub-products, but an approximated version of it. As it can be seen in Figure 8, the seven least significant columns of the sub-products are marked with red color, indicating that they are summed using an approximate adder that uses one OR gate in every column. However, the nine most significant columns are added with an exact adder. Note that the first sub-product has only seven output bits as shown in Figure 2.

IV. PERFORMANCES
The proposed and reference circuits are all synthesized in a 14nm FinFET technology, using Cadence Genus and imposing proper timing constraints. Power dissipation is computed by simulating the final netlist with random inputs, to obtain the switching activity of each node. The input vector array is identical for all designs with the same input bit width. In the following tables "Min delay" refers to the strictest timing constraint, at which each circuit can be synthesized with non-negative slack and provides information regarding the maximum working speed of each design.
Area, power, and delay are compared against the results of the corresponding (4Â4, 8Â8, 16Â16, or 32Â32) exact multiplier. The exact design is obtained by describing the circuit in HDL with the multiplication operator and letting the synthesizer choose the near-optimal topology for the given constraint. Therefore, the electrical performances are sometimes slightly worse than those presented in the literature that compare with a fixed exact design.
Error performance is obtained by an exhaustive simulation, for both 4Â4 and 8Â8 multiplier designs. For 16Â16 and 32Â32 designs the error performances are computed using a random set of uniformly distributed test vectors. The numbers of test vectors are 10 5 and 10 6 for the 16 and 32 bit multipliers, respectively.
The error metrics that are used in this paper are listed in the following. Let Y E_i be the exact result of the multiplication between the two n-bits operands A i and B i such that Y E_i ¼ A i B i and let Y A_i be the approximated output returned by the investigated inexact multiplier. The error E i , of each multiplication is given by: While the error distance ED i is defined as: And the relative error distance RED i , as: 1. The Normalized Mean Error Distance, NMED, is defined as the average value of ED divided by the maximum possible value returned by the multiplier, which is: (2 n-1 ) 2 . 2. The Mean Relative Error Distance, MRED, is given by the average value of RED. 3. The number of effective bits, NoEB, is defined as: where E ms is the means square error, given by the average value of E 2 . 4. The error rate, ER, is defined as the number of erroneous multiplications (with E i 6 ¼ 0) over the total amount of possible inputs 2 2n .

A. 4Â4 APPROXIMATE MULTIPLIERS
The electrical and error performances of the considered 4Â4 approximate multipliers are summarized in Table 3. To ensure a fair comparison between the circuits, avoid biased optimizations by the synthesizing tool, and emphasize the low power performance of the structures, the circuits have been synthesized with the timing constraint of 250ps to obtain the area and power values. The circuits are simulated applying a uniformly distributed random set of 2Á10 4 test vectors to gather the switching activity. The total power Ã Area and power are reported for the circuits synthesized with a timing constraint of 250ps. min delay is the minimum timing at which the circuit can be synthesized with a non-negative slack.
reported in the table is computed for a clock frequency of 1GHz. It is worth noting that the circuits proposed in this paper are for general purpose applications thus a uniform distribution of the input is considered. However, automated designs [15]- [18] or dedicated circuits previously presented in the literature, could provide better performances for a specific distribution of the input vectors.
As it can be observed in Table 3, the proposed circuits are very small and come second only to the OR-based design. The same can be stated also for power dissipation, with N2 having an unquestionable advantage. When it comes to speed, N2 is the fastest design while N1 is among the fastest. The proposed multipliers exhibit competitive NMED, MRED and NoEB with respect to the state-of-the-art. The relative reduction in power dissipation with respect to the exact design vs NoEB is shown in Figure 9. The proposed design N2 dissipates 18% less power than the least energyhungry architectures up to date, M2 and MxA proposed in [26] and [40] respectively, while still providing a smaller approximation error.

B. 8Â8 APPROXIMATE MULTIPLIERS
The results of the 8Â8 approximate multipliers are shown in Table 4. Recursive designs are reported at the top part of the table, while selected approximate designs following different methodologies, are shown at the bottom part. Power reduction against number of effective bits for all designs is displayed in Figure 10. Non-filled shapes in the figure correspond to non-recursive designs.
All circuits are synthesized for a 1000ps timing constraint and simulated with the same set of 2Á10 4 uniformly distributed random vectors. The total power reported in the table is computed for a clock frequency equal to 1GHz.
The designs presented in [13] employ a smaller, segmented multiplier. Specifically, instead of an 8-bit multiplier, a 4-bit multiplier with or without error correction respectively, is used. The product is then shifted accordingly. In this simple circuit, hardware resources and power consumption are kept to significantly low levels, while the error metrics are still competitive.
Note that the entries of [14] and [20] exhibit identical electrical performances respectively, since they refer to the same circuits with different settings (both designs allow for configurable accuracy). While the range of chosen accuracy in [14] is limited and the innate flexibility results in increased area requirements, the circuit is very fast, overcoming all the investigated contributions, including the proposed designs. As it can be observed in Table 4, the minimum accuracy of this design, is still greater than that of the design M8-2 proposed in [26], while power reductions are similar.
The circuit presented in [20] offers dynamic truncation at runtime, by enabling or disabling AND gates that form specific partial products. "DT0" refers to the case where all the AND gates are enabled, resulting in an exact multiplier. However, the additional hardware resources result in a greater power consumption with respect to the exact design (hence the negative power reduction). "DT8" refers to the maximum possible truncation where a 43.62% power reduction is achieved. The numbers in the names indicate the level of truncation.
The authors in [26], offer a number of circuits covering a wide range of accuracy. Designs M8-5 and M8-6 are the most precise ones, using one approximate and three exact 4bit multipliers. While the synthesized circuits are slightly slower than the exact multiplier, they offer some power reduction at a relatively small expense in accuracy.
Designs Ax8_1 and AxRM1 presented in [40] and [42] respectively, employ three exact and one approximate modules. While these are the most accurate designs presented in the respective papers, they are still less accurate than M8-6 and M8-5 of [26], and even less accurate than the proposed N8-5 and N8-6. At the same time, the circuits are quite large, and slower than the exact multiplier. This behavior follows the pattern presented in Figure 9, for the 4Â4 building   blocks. For the less accurate designs, Ax8_3 with one accurate module, manages to surpass M8-1 that uses no accurate modules, both in accuracy and in power reduction. However, it is slightly larger and slower.
An interesting architecture, is proposed in [38]. It uses one exact multiplier, two custom modules, and an OR-based 4Â4 approximate multiplier for the least significant part. This relatively small design, in terms of accuracy performs similarly to the proposed design N8-L1, as well as to M8-3 and M8-4. It achieves a significant power reduction with respect to M8-3 and M8-4 but N8-L1 leads. Among circuits with a similar power reduction percentage, M8-2 and Yang_7'b1, it exhibits a far more accurate behavior.
As it can be seen in Table 4, among the recursive topologies, the proposed circuits N8-L1 and N8-L2 occupy the smallest area and achieve the biggest reduction in power consumption. Moreover, they are very fast circuits, bested only by multipliers proposed in [13] and [14], that are not recursive, but optimized for a given bit width. At the same time, they exhibit competitive behavior in terms of accuracy. As it can be observed in Figure 10, even though there are more precise circuits in the literature, N8-L1 and N8-L2 provide a certain level of accuracy at a very low cost.
On the other hand, proposals N8-5 and N8-6, are very accurate circuits, exploiting three exact, and one proposed 4Â4 multipliers. They offer a very high number of effective bits, matched only by the designs, M8-5 and M8-6, [26]. However, exploiting the proposed designs of N1 and N2, N8-5 and N8-6, achieve a greater power reduction, as it can be observed in Figure 10 and Table 4.
As demonstrated, we have designed four 8Â8 multipliers, two with a NoEB around 8, and two with a NoEB around 12, that to the best of our knowledge exhibit a significant advancement with respect to the state-of-the-art. Ã Area and power are reported for the circuits synthesized with a timing constraint of 1000ps. min delay is the minimum timing at which the circuit can be synthesized with a non-negative slack.

C. 16Â16 APPROXIMATE MULTIPLIERS
The 8Â8 designs and the methodologies described above, can be used to scale up to 16Â16 multipliers. As already shown in section III, two different approaches are used to generate 16Â16 designs. Table 5 summarizes the architectures of the considered designs. The circuits following the most straightforward approach (exact sub-product addition) are presented at the top part of table 5, while the ones using the technique presented in [38] (approximate sub-product addition), are at the bottom part.
The performances of 16Â16 approximate recursive multipliers are shown in Table 6. All 16Â16 designs have been synthesized under the same timing constraint: 1000ps. Furthermore, they have been simulated with the same set of 10 5 uniformly distributed random vectors, with an input switching frequency equal to 1GHz.
The architecture proposed in [38], results in designs that significantly outperform other contributions. It should be noted that the most straightforward approach from an algorithmic point of view, followed by the circuits at the top part of Table 5, does not result in optimal configurations. In fact, each 16Â16 multiplier is composed by four identical approximate 8Â8 multipliers, that in turn may be composed by some exact 4Â4 multipliers. On the other hand, [38] employs a single exact 8Â8 multiplier placed in the most significant part, thus making an important impact on accuracy, despite the final approximate addition.
Moreover, the non-recursive exact and OR-based 8Â8 multipliers in the most and least significant parts respectively, as well as the approximation in the final addition, allow this architecture to exploit minimal hardware resources. Therefore, the three  Ã Area and power are reported for the circuits synthesized with a timing constraint of 1000ps. min delay is the minimum timing at which the circuit can be synthesized with a non-negative slack. designs that follow this approach are the smallest, fastest, and least-power hungry. The circuit proposed in [38], manages to outperform N16-L1 and N16-L2 in accuracy, while N16-L2 slightly overcomes LOAM in terms of power reduction. N16-L2 also occupies the smallest area. Among the strictly recursive designs, N16-5 and N16-6 achieve a higher power reduction than circuits with similar or even lower accuracy.

D. PROPOSED 32Â32 APPROXIMATE MULTIPLIERS
The performances of the proposed 32Â32 approximate multipliers are shown in Table 7. The circuits have been synthesized under a timing constraint of 1000ps and simulated with 10 6 uniformly distributed random vectors, and an input switching frequency equal to 1GHz.

V. APPLICATIONS
Image processing is one of the most commonly considered error resilient applications and many papers test the proposed circuits in this scenario. In this paper, two image processing applications are considered: image blurring and image sharpening. The applications provide a more in depth understanding of the applicability range of the proposed designs.

A. IMAGE SMOOTHING
In image processing, low pass filtering results in image smoothing which effectively removes the high spatial frequency noise from the image. The low-pass filter exploits a moving kernel that processes one pixel at a time and modifies it considering the pixels in proximity. The processing of each pixel requires a number of multiplications that depends on the size of the kernel. In fact, the value of the modified pixel is the weighted average of the neighboring pixels. Moreover, image blurring is an error tolerant application, as the human eye is not able to detect trivial details.
The kernel considered for smoothing is a two dimensional, rotationally symmetric, 3Â3 Gaussian low-pass filter, with a standard deviation equal to 1.5, as in [27]. The floating-point numbers of the kernel are multiplied by 2 10 and then rounded. In this way, the kernel's values are appropriate for the considered 8-bit input multipliers. The original and modified kernels are shown in Table 8.
Image processing, exploiting the investigated multipliers, has been performed aiming to blur a test image. The obtained images are shown in Figure 11. The same processing has been also performed with exact multipliers to provide an effective comparison for all designs. The structural similarity index (SSIM) and the peak signal to noise ratio (PSNR) provide a numerical indication of each multiplier's performance in image smoothing.
The results are shown in Table 9. The recursive designs are in the top part of the table, while the non-recursive ones occupy the bottom part. The proposed circuits N8-5 and N8-6 share the best results with the designs M8-5 and M8-6 proposed in [26] and Ax8_1 proposed in [40]. N8-L1 and N8-L2 follow close behind but still show a competitive behavior while achieving the greatest power reduction, as shown in Figure 10.

B. IMAGE SHARPENING
Sharpening or high pass filtering aims to make fine details more distinct and remove the blurring of a digital image, by enhancing transitions in the spatial intensity of the image. High frequencies are boosted while low frequencies are reduced. It should be noted that over-sharpening might result in unwanted halo artifacts.
The image sharpening process is similar to the smoothing process, but it uses a different kernel for the convolution. The authors in [14], [19], [25], [35] presented an image sharpening application exploiting the following 5Â5 kernel: Mask ¼  Ã Area and power are reported for the circuits synthesized with a timing constraint of 1000ps. min delay is the minimum timing at which the circuit can be synthesized with a non-negative slack. The output pixels of the sharpened image are given by: In (20), X(i, j) denotes a pixel from the input image, while Y(i, j) from the sharpened output.
We have used the considered approximate multipliers, as well as an exact multiplier to sharpen an RGB test image. The results are demonstrated in Figure 12. SSIM and PSNR with respect to the sharpened image by exact multipliers are reported in Table 10. All proposed circuits have a high similarity ratio with the reference image. Even though there are better performing multipliers for this application, the proposed circuits exhibit reasonable behavior for such lowpower designs.

C. IMAGE CLASSIFICATION WITH CONVOLUTIONAL NEURAL NETWORKS
Convolutional Neural Networks (CNNs) play an increasingly important role in machine learning, particularly for image recognition, object identification and speech recognition tasks. CNNs are error-tolerant and require a huge number of multiplications, therefore they are ideal candidates for using approximate multipliers [4]. We performed some experiments of image recognition with the investigated approximate multipliers, by using a simple CNN composed by 9 layers, not counting the input one. The CNN includes two convolutional layers, each one followed by batch normalization and Rectified Linear Unit (ReLU) layers, a max pooling layer, a fully connected layer a final softmax layer. Two datasets have been considered: MNIST and SVHN. The former is a dataset of handwritten digits containing 70000 28Â28-pixel, greyscale images split into 60000 training images and 10000 testing images [44]. The Street View House Number (SVHN) dataset contains 100000 32Â32 RGB images of house numbers obtained from Google Street View, divided in 73257 training and 26032 test images [45]. In our tests, SVHN images have been converted into greyscale as the color has no significance in the classification [4].
The training of the CNNs has been performed in Matlab, by using floating-point arithmetic. After training, quantization of the convolutional and fully connected layers, requiring the vast majority of calculations, has been performed, to allow the testing of the approximate multipliers. We use test images to exercise the network and collect the dynamic ranges of the inputs of convolutional and fully connected layers. These inputs are positive values, due to the ReLU layers, and are easily quantized as 8-bit unsigned numbers that can directly feed the multipliers. The weights in the convolutional and fully connected layers of the network, on the other hand, are learnt during training and are signed numbers. Therefore, following [43], after quantization we converted the weights in sign-magnitude representation to perform multiplications using the investigated unsigned approximate multipliers.
Classification results are reported in Table 11. Column "Acc. loss" refers to the reduction in classification accuracy (in percentage) compared to the floating-point multiplier.
For the MNIST dataset, the considered CNN in floatingpoint implementation shows a remarkable accuracy of more than 99%. The accuracy remains almost unchanged when using exact 8-bit multiplier after network quantization. The majority of the investigated approximate multipliers perform well with this simple dataset, with some exceptions (Yang_7'b0, Ax8_2, Ax8_3, SSM_m4, Kul8, Reh8, AxRM3, ISH1). The proposed N8_L1 gives very good results, showing a mere 0.64% reduction in accuracy, with more than 44% power saving.
For the SVHN dataset the CNN accuracy is about 87%. In this case, network quantization yields a slight accuracy improvement, a phenomenon already observed in literature [4].
Several approximate multipliers yield a large accuracy reduction in this more demanding application. The multipliers giving an accuracy drop lower than 0.5% are: proposed N8_5 and N8_6, DT2 of [20], M8_5 and M8_6 of [26], Ax8_1 of [40] and AxRM1 [42]. Among these, the proposed N8_6 gives the best power reduction of more than 14%. Design Yang_7'b1 of [14] also performs well, with a reduction in accuracy of 3.3% and a power saving of more than 24%.

VI. CONCLUSION
In this paper we introduced two low-energy 4Â4 approximate multipliers, obtained by simplifying the sum and carry expressions of the partial product matrix adders. These designs were recursively used to generate 8Â8, 16Â16 and 32Â32 approximate multipliers, following two different architectures. Two 8Â8 designs have been proposed with a NoEB around 8, and two with a NoEB around 12. Each 8Â8 approximate multiplier consists of exact, proposed and OR-based, 4Â4 designs. By exploiting the low power proposed circuits, N1 and N2, our approximate multipliers achieve great power reduction while still exhibiting competitive error performance. The error vs. power trade-off is compared with state-of-the-art approximated multipliers showing an improvement for both architectures. The proposed 8Â8 circuits are tested in image processing and image classification using convolutional neural network demonstrating that these can be used to save power without sacrificing the result in typical error resilient applications.  FIGURE 12. Image sharpening obtained with different multipliers. The circuits proposed in this paper are highlighted in bold. EFSTRATIOS ZACHARELOS received the BS degree in physics from Aristotle University, Thessaloniki, Greece, in 2016, and the MS degree in electronic physics from Aristotle University, Thessaloniki, Greece, in 2019. He is currently working toward the PhD degree in electronic engineering with Federico II University, Naples, Italy. He is the co-author of four papers and his main research interests include real-time resampling, signal processing, and approximate computing.
ITALO NUNZIATA received the BS degree in electronic engineering from the University of Napoli Federico II, Italy. He is currently working toward the MS degree in electronic engineering with Federico II University, Italy, and the double degree in electronic and telecommunication with the University of Technology of Lodz, Poland. His current research interests include design and analysis of digital VLSI circuits and approximate computing.
GERARDO SAGGESE received the MSc degree (Hons.) in electronic engineering from the University of Napoli Federico II, Italy, and the double degree in electronic and telecommunication from the University of Technology of Lodz, Poland, in 2020. He is currently working toward the PhD degree in information technology and electrical engineering with the University of Napoli Federico II. His current research interests include signal processing, low power integrated circuit, brain machine interface and power circuits.
ANTONIO G. M. STROLLO (Senior Member, IEEE) received the MS degree (Hons.) and the PhD degree in electronic engineering from the University of Napoli Federico II, Italy. From 2002, he is a full professor with the University of Napoli Federico II, Italy, where he has been the head of the Department of Electronic and Telecommunication Engineering from 2005 to 2008. He has published more than 140 papers on international journals and conferences. His current research interests include design and analysis of digital VLSI circuits. Associate editor for IEEE Transactions on Circuits and Systems-I (2009 to 2012); currently he is associate editor for Integration, the VLSI Journal.
ETTORE NAPOLI (Senior Member, IEEE) received the PhD degree in electronic engineering in 1999, the electronic engineering degree (Hons.) in 1995, and the physics degree (Hons.) in 2009. He was a research associate with the Engineering Department, University of Cambridge, U.K., in 2004. Full professor with the University of Napoli Federico II in 2020. Full professor with the University of Salerno since 2021. He has authored or coauthored more than hundred articles published in international journals and conferences. His research interests include VLSI circuit design and modeling and design of power semiconductor devices.