A Novel Approximate Adder Design Using Error Reduced Carry Prediction and Constant Truncation

This paper proposes a novel approximate adder that exploits an error-reduced carry prediction and constant truncation with error reduction schemes. The proposed adder design techniques significantly improve overall computation accuracy while providing excellent hardware efficiency. Particularly, the proposed carry prediction technique can reduce a prediction error rate by up to 75% compared to existing approximate adders considered in this paper. Furthermore, the error reduction technique also enhances the overall computation accuracy by decreasing the error distance (ED). Our experimental results show that the proposed adder improves the normalized mean ED (NMED) and mean relative ED (MRED) by up to 91.4% and 98.9%, respectively, compared to the other approximate adders. Importantly, an excellent design tradeoff allows the proposed adder to be the most competitive of the adders under consideration. Specifically, the proposed adder achieves up to 95.7%, 91.1%, and 93.2% reductions of the power-NMED, energy-NMED, and area-delay product (ADP)-NMED products, respectively, compared to the other adders. Our adder enhances the power-, energy-, and ADP-MRED products by up to 99.4% compared to the others. In particular, the figure of merit (FoM) considering both hardware and accuracy of the proposed adder is up to 93.05% smaller than that of the other approximate adders considered herein. Furthermore, we confirm that the approximation errors caused by the proposed adder have very little impact on output quality when adopted in practical applications, such as digital image processing and machine learning.


I. INTRODUCTION
With the prevalence of battery-operated mobile and portable devices, power and energy consumption become the key constraint in system design because applications on these devices process a vast amount of computationally intensive information, such as multimedia (i.e., image, video, and audio) processing, deep learning, data mining, and recognition, under a limited power and energy budget [1]- [6]. Many applications do not always require perfect computation accuracy [7]- [9]. For example, multimedia processing that involves human senses is error-tolerant. In other words, humans usually do not perceive the output quality degradation caused by computation errors on these applications, and The associate editor coordinating the review of this manuscript and approving it for publication was Cihun-Siyong Gong . a certain level of errors can be acceptable. The limitation of human perception offers an opportunity for a new computing paradigm, approximate computing, trading computation accuracy for power and energy [10]- [12]. Because adders are fundamental arithmetic components in computing systems, the design of efficient approximate adders is a practical way to enable approximate computing. Therefore, it has gained remarkable attention from researchers and a significant number of approximate adder designs have been presented in the technical literature [13]- [35]. We will review some existing approximate adders in Section II.
Approximate adders can be classified as block-based and full adder (FA)-based designs. Block-based approximate adders split an entire adder into smaller multiple sub-adders that perform partial additions concurrently [22]- [29]. The main idea of this approach is to cut a long carry propagation VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ chain to achieve faster additions. However, it requires more area and power than FA-based approximate adders. FA-based adders use approximate 1-bit FAs to add some lower-order input bits approximately by replacing accurate FAs with approximate ones in the corresponding bit positions [13]- [21]. This improves the area and power performance at the expense of the computation accuracy degradation.
In this paper, we propose a new approximate adder design based on new approximate FA cells, enhanced carry prediction, and a constant truncation with error reduction. The proposed carry prediction scheme significantly reduces the prediction error rate by up to 75% compared to existing approximate adders considered here. Also, the truncation with error reduction logic enhances the overall computation accuracy while reducing energy and power consumption. When implemented in a 32-nm CMOS technology, the proposed adder is 1.49×, 1.90×, and 3.12× better area-, power-, and energy-efficient, respectively, than a traditional adder. Furthermore, compared to existing approximate adders, our adder improves overall computation accuracy by up to 98.9%. When jointly analyzing the adders in terms of hardware and accuracy, the proposed design is the most competitive among the adders considered.
In summary, this paper makes the following key contributions in designing approximate adders: • We present a novel efficient approximate adder design that effectively trades off between hardware cost and computation accuracy through systematic analysis, and prove that our design outperforms the others by extensively comparing it with 12 approximate adders.
• We propose 1) a new carry prediction scheme that reduces the prediction error rate by up to 75% compared to the others, 2) approximate FA cells that improves accuracy, and 3) a constant truncation with an error reduction scheme that reduces hardware cost while offering good accuracy performance. The remainder of this paper is organized as follows. Section II provides a brief review of existing approximate adders. In Section III, we present the proposed adder, which consists of our proposed approximate FAs, novel carry prediction, and constant truncation with error reduction. Illustrated examples of the adder operation and mathematical analysis of the carry prediction error rate and overall error rate are also provided. Then, Section IV explains the experimental results and systematic analysis of the proposed adder as well as extensive comparison with the 12 existing approximate adders. Also, a joint analysis of the adders in hardware and accuracy aspects is presented. In Section V, the application of the approximate adders to digital image processing and machine learning are presented. Finally, Section VI concludes the work.
The LOA consists of two parts: an accurate part and an inaccurate part [13]. The former part uses a traditional precise adder, such as the ripple carry adder (RCA) and carry-lookahead adder (CLA), to calculate the most significant bits (MSBs) with no computation error. Whereas, the latter part only uses an OR operation to approximately obtain LSB summations. Furthermore, the output of an AND operation for the MSB input pair of the inaccurate part is utilized as a carry input to the accurate part to improve overall computation accuracy. Design variants based on the LOA have been proposed to further optimize the LOA, such as LOA without the AND-based carry prediction (LOAWA), optimized lower-part constant OR adder (OLOCA), hardware optimized and error reduced approximate adder (HOERAA), hardware optimized adder having a near-normal error distribution (HOAANED), and hybrid error reduction lower-part OR adder (HERLOA) [14]- [18]. The LOAWA is identical to the LOA, except for the AND-based carry prediction [14]. In other words, the carry input to the accurate part is fixed to a constant ''0,'' which degrades accuracy but improves the computation speed. The OLOCA is also similar to the LOA in that the OR operation is utilized for the inaccurate part approximation, but it outputs a constant ''1'' to a few LSBs regardless of the corresponding bit inputs [15]. This also degrades accuracy a bit while reducing hardware cost. In addition to the OLOCA, the HOERAA uses the OR operation for two MSB input pairs of the inaccurate part and sets the remaining LSB outputs to a constant ''1'' regardless of the inputs [16]. For the MSB output of the inaccurate part, it uses a 2-to-1 multiplexer to select ''0'' or an OR operation output of the corresponding input pairs. The multiplexer output is then used in an OR operation with the AND gate output of the second MSB input pair of the inaccurate part. Also, it includes an AND-based carry prediction for the accurate part, which also serves as the selection input of the multiplexer. The HOAANED is derived from the HOERAA by including one additional OR gate at the MSB of the inaccurate part [17]. This OR gate contributes to the improvement of an error metric, and thus, the HOAANED produces outputs with almost normal error distributions. To enhance overall computation accuracy, the HERLOA combines the basic LOA structure with the hybrid error reduction scheme [18]. Figure 1 shows the architecture of the inaccurate part of the HERLOA. Note that the accurate part is the same as the LOA (i.e., precise adder). When the second MSB input pair of the inaccurate part is both ''1,'' error reduction logic decreases the error distance (ED) by investigating the MSB input pair. The grayed gates in Figure 1 are the hybrid error reduction logic, while the others are the LOA logic. The error rate is reduced by replacing an OR gate at the MSB in the LOA with an XOR gate in the HERLOA.
The ETAI, like the LOA, divides an adder into two parts [19]. The inaccurate part of the ETAI utilizes its own modified XOR operation instead of the traditional  OR operation. Note that the ETAI also uses a precise adder for MSBs additions. Furthermore, the carry prediction for the accurate part is a key difference from the LOA. The lack of prediction reduces accuracy while improving the speed. The carry predicting ETA (CPETA) was presented to improve computation accuracy by including a carry prediction scheme to the accurate part [20]. The CPETA adopts the AND-based carry prediction, which is the same as the LOA. Figure 2 shows a block diagram of the simplified ETA (SETA), which optimizes the ETAI's modified XOR operation to reduce hardware costs without a significant accuracy loss [21]. The modified XOR operation of the ETAI checks the input pairs from the MSB to LSB direction of the inaccurate part to check that both bits of the corresponding input pair are ''1'' whereas the SETA only checks a specific input pair to test if both bits are ''1.'' This reduces hardware costs compared to the ETAI without significant accuracy degradation.
Different from the LOA, ETAI, and their variants that split an adder into two parts, other approximate adder structures have been presented in the literature as well. The reconfigurable approximate CLA (RAP-CLA) comprises several small sized blocks (i.e., windows) that are overlapped each other to predict the carry of each bit position [22]. In other words, each carry is speculated by a sub-block to reduce the critical path delay by cutting the long carry propagation path. The block-based carry speculative approximate adder (BCSA) includes a number of non-overlapped blocks, each of which consists of a sub-adder, a carry predict unit, a select unit, and a multiplexer [23]. Each block's carry is predicted by either the carry predict unit or the sub-adder and selected by the selection unit. Additionally, the BCSA with its own error recover unit (BCSA ERU ) can improve the accuracy of the original BCSA without increasing the delay when an error occurs in a certain condition.

III. PROPOSED APPROXIMATE ADDER
This section presents our proposed FA cell-based approximate adder, which exploits a novel carry prediction scheme and a constant truncation technique to reduce the ED and improve overall computation accuracy. We call our adder the error reduced carry prediction approximate adder (ERCPAA). We denote two n-bit input operands and one n-bit output of the adder as A n−1:0 , B n−1:0 , and S n−1:0 , respectively. Also, A i , B i , and S i represent the (i) th LSBs of A n−1:0 , B n−1:0 , and S n−1:0 , respectively.
A. OVERALL ADDER ARCHITECTURE Figure 3 shows the overall hardware architecture of the proposed approximate adder with n-bit inputs. An n-bit adder is divided into two parts: a k-bit accurate part and an (n − k)-bit inaccurate part, where k < n. The accurate part simply consists of a k-bit precise adder that produces an accurate output (i.e., S n−1:n−k ) from k MSB inputs (i.e., A n−1:n−k and B n−k:n−k ) and a carry input (i.e., C in ). The inaccurate part uses some of the remaining LSB inputs to generate an approximate output and a carry input to the precise adder. Note that the sizes of the accurate part and inaccurate part do not have to be equal, and the precise adder can be implemented in any type of traditional adders, such as RCA and CLA. The inaccurate part is further divided into three parts: an array of the proposed approximate FA cells, a carry prediction logic, and a constant truncation with error reduction logic. The proposed FA cell (see blue-highlighted box) simplifies the conventional single-bit FA cell to produce an approximate summation and an approximate carry, and is placed in some higher-order bit positions of the inaccurate part. The carry prediction logic, which is highlighted in green, generates the carry input to the precise adder. While most FA-based approximate adders employ an AND operation with the MSB inputs of the inaccurate part to produce the carry input, our prediction logic leverages the two MSB inputs to improve carry prediction accuracy at the cost of two additional logic gates. The constant truncation with error reduction logic highlighted in red sets l LSB outputs (i.e., S l−1:0 ) to either a constant ''0'' or ''1'' to reduce hardware costs depending on input conditions. In other words, the l LSB inputs are not used to generate approximate summations. It also assigns the other output bits except for the MSB of the inaccurate part to a constant ''0'' to reduce the ED under certain input conditions. We will describe the condition to determine to fix the output bits to ''0'' or ''1'' with illustrative examples in Section III-D.

B. PROPOSED APPROXIMATE FULL ADDER
An FA is the key building block for carry propagate adders (e.g., RCA). The traditional 1-bit FA adds two inputs, A i and B i , as well as a carry from the previous bit position C i−1 and VOLUME 9, 2021 produces a sum S i and a carry output C i using Although the FA requires two XOR gates to generate a sum, we replace the XOR gates with OR counterparts to do the same approximately to reduce hardware overhead in our approximate FA. In addition, the FA generates a carry output C i from not only the two inputs, A i and B i , but also the carry from the previous bit position C i−1 . In other words, the carry of the previous bit position can be propagated to the next bit position through the current FA, resulting in a long critical path delay and degraded hardware performance in the carry propagate adders. To reduce the critical path delay and hardware overhead, we remove the dependency of the carry from the previous bit position to generate the carry output in our FA. Thus, the Boolean equations of our approximate FA are given by Consequently, the approximate part using the proposed FA cell does not form the carry propagation chain from the lower to the higher-order bit positions and thus the delay of the approximate part is consistent, although the size of the approximate part is larger (i.e., k decreases under a given n). Note that the MSB position of the inaccurate part has a different configuration of the FA, which uses an XOR gate instead of the OR gate to generate the sum of the two input operands A i and B i . This improves the overall computation accuracy since the XOR-based FA gives a more accurate sum, and it also allows the carry prediction logic to produce a more accurate carry input to the precise adder than the ORbased FA. Table 1 depicts the truth table of the   sum S i when both operands are ''1'' and the carry of the previous bit position is ''1''.

C. PROPOSED CARRY PREDICTION TECHNIQUE
The accurate part can take a carry input generated from the inaccurate part to improve overall computation accuracy at the expense of a few logic gates [13], [15]- [18], [20]. The AND-based carry prediction scheme, which has an error of approximately 25%, is widely adopted since it is easily implemented by performing an AND operation with the inaccurate part's MSB inputs (i.e., A n−k−1 AND B n−k−1 ) to produce the carry input to the precise adder. In our proposed prediction, only two additional gates (i.e., an AND gate and an OR gate in the green highlighted box in Figure 3) are utilized to produce the carry input with twice the prediction accuracy of the conventional AND-based one. Also, the inputs of the MSB and its previous bit position of the inaccurate part (i.e., A n−k−1:n−k−2 and B n−k−1:n−k−2 ) are exploited to predict the carry input. Let P i denotes the propagate signal of the (i) th bit position, and the carry from the previous bit position C i−1 is propagated to the carry output C i if the propagate signal is ''1,'' defined as Since our carry prediction scheme leverages the inputs of two bit positions, a carry can be generated from either the (n−k −1) th or (n−k −2) th bit position. If a carry is produced in the (n − k − 1) th bit position, the carry input C in is simply C n−k−1 . On the other hand, if a carry is generated in the (n − k − 2) th bit position, the carry C n−k−2 should be propagated through (n − k − 1) th bit position to pass it to the accurate part. Therefore, the carry input C in is derived by where C i is defined in (4). According to Equation (7), one XOR, three AND, and one OR gates are required to generate the carry input C in . C n−k−1 and C n−k−2 can be obtained from the proposed FAs in the corresponding bit positions and P n−k−1 can also be calculated using the XOR gate of the FA in the MSB position of the inaccurate part. It is worth noting that one of the reasons to replace the OR with an XOR in the FA at the MSB is to generate a P n−k−1 signal. Therefore, we only need two additional gates (see green box in Figure 3) to implement the proposed carry prediction logic.
Since our carry prediction is achieve using the inputs of the two MSB positions, it is correct when a carry is generated from any of these two bit positions. However, the carry prediction would be incorrect when a carry is produced from any lower-order bit position beyond the (n − k − 2) th bit position and this carry is propagated through the (n − k − 2) th and (n − k − 1) th bit positions. Assuming that the two operands A and B are bitwise independent, then the propagated signal and carry are also bitwise independent. We denote an event that a carry is generated from (n − k − 3) th or any of its lower-order bit positions by E ca : where C i and P i are defined in (4) and (5), respectively, and the probability of this event is given by P i C 0 are mutually exclusive. Therefore, the error rate of the carry prediction of the proposed adder ER CP is given by Note that P n−k−1 , P n−k−2 , and E ca are independent.

D. CONSTANT TRUNCATION WITH ERROR REDUCTION
The proposed adder outputs a constant to a few LSBs to reduce hardware overhead by sacrificing overall accuracy slightly since the lower-order outputs have relatively less impact on the accuracy than higher-order outputs [15]- [17], [33]- [35]. Figure 4 exhibits an example of constant truncation operations with error reduction using the adder design parameters n = 16, k = 8, and l = 4. As shown in Figure 4(a), our adder sets the l LSB outputs to ''1'' regardless of the inputs of the corresponding bit positions. When a carry is generated from (n − k − 2) th bit position and then propagated through (n − k − 1) th bit position, our adder performs error reduction. In short, the reduction is performed when P n−k−1 C n−k−2 = 1. Under this given input condition, the correct output of (n − k − 1) th bit is ''0,'' however, our FA produces ''1'' as the output at this bit position as shown VOLUME 9, 2021 in Figure 4(b). This means that the approximate summation will be larger than the correct one in this case. Instead of forcing the l LSB outputs to ''1,'' hence, the proposed adder sets all outputs of the inaccurate part except for its MSB position to ''0,'' (i.e., S n−k−2:0 = 0) making the approximation output closer to the correct addition. Under the given input shown in Figure 4(b), the ED, defined by |S approximate − S correct | where S approximate and S correct are approximate and correct summations, respectively, decreases from 211 to 84. This reduction technique allows up to a 2 n−k−1 − 1 decrease in the ED.

E. ERROR RATE ANALYSIS
The proposed adder generates an output error when two input operands A i and B i of any bit position from (n − k − 2) th to (l) th LSBs are both ''1.'' In other words, if the inputs of at least one OR-based FA are both ''1,'' an error occurs. According to Table 1, the OR-based FA produces an incorrect output at sum S i when A i = 1 and B i = 1, whereas the XOR-based counterpart does not. This input condition generates a carry generation for the next bit position, which results in an output error at the sum of the next bit position. Furthermore, an error occurs when both the inputs of any bit position at the constant truncation part (i.e., (l − 1) th to (0) th bit position) are either ''0'' or ''1'' because the part fixes the output to ''1.'' In other words, the constant truncation part output is always correct when either of the two inputs is ''1.'' To simplify the error rate analysis, we first calculate a probability of the input condition to make the output of the adder correct, and then the error rate can be achieved by obtaining its complement. Since the proposed adder produces correct outputs when A i = 1 and B i = 1 where l ≤ i ≤ n − k − 2 and A i = B i where 0 ≤ i ≤ l − 1, we can define an event E co that the adder generates always correct outputs as follows: (11) and the probability of this event is given by Note that we assumed that the two input operands A and B are bitwise independent. The error rate of the proposed adder is the probability of the complement of the event. Therefore, the error rate ER ERCPAA is given by

IV. EXPERIMENTAL RESULTS AND COMPARISON
In this section, we evaluate the performance of the proposed adder in terms of both hardware costs and computation accuracy through systematic analysis. Also, an extensive comparison with other existing approximate adders is presented to demonstrate the potential benefits of the proposed adder.

A. EXPERIMENT SETUP AND EVALUATION
We designed our adder in Verilog HDL and synthesized it with the Synopsys 32-nm generic library (SAED32) using Synopsys Design Compiler to examine the hardware characteristics of the proposed approximate adder in terms of area, delay, power, and energy [36]- [38]. We implemented a 16-bit adder using an 8-bit RCA-based precise adder (i.e., n = 16 and k = 8). Prior studies suggested that a size of 7 to 9 bits for the inaccurate part would be appropriate to obtain a good tradeoff between output quality and power and energy saving for practical applications, such as video and image processing, and a 16-bit adder was widely adopted in these applications [7], [9], [23], [32], [39]. Therefore, we chose the adder design parameters of n = 16 and k = 8. In addition to the hardware cost, we also analyzed the error characteristics of the proposed adder by developing a software-based simulator. To exhaustively test a 16-bit adder, 2 32 distinct input pairs can be considered but it is extremely intensive to compute. Therefore, we use 10 million (i.e., 10 7 ) input pairs, each of which was uniformly distributed random input, to the proposed adder to obtain the error characteristics measured by various error metrics, such as the overall error rate, carry prediction error rate, mean error distance (MED), normalized mean error distance (NMED), and mean relative error distance (MRED).

B. TRADEOFF ANALYSIS OF THE PROPOSED ADDER
In our proposed design, the area, power, and energy performance degrade as the design parameter l decreases because a smaller l requires more logic gates to implement the adder. However, as l decreases, the overall computation accuracy performance improves. The power-NMED product was introduced to assess approximate adders considering the power and accuracy performance together [40]. Since this metric does not consider the area aspect, we can consider a new joint metric, the area-NMED product, to analyze the area and accuracy performance collectively. Similarly, the power-MRED and area-MRED products can be employed to jointly analyze the costs and accuracy.
To seek the best tradeoff between the hardware cost and accuracy of our adder, we adjusted design parameter l and obtained the power-NMED/MRED products and area-NMED/MRED products. It is noteworthy that the delay is consistent, although l varies, and thus we exclude the delay for the tradeoff analysis. Figure 5 shows the tradeoff of the hardware costs and accuracy for the proposed adder with various values of l. We varied the design parameter l from 1 to 6 because our adder requires at least two FAs at the  (n − k − 1) th and (n − k − 2) th bit positions to produce the carry input (i.e., l = 6) and at least one constant truncation bit (i.e., l = 1). As expected, the power and area become better and the NMED does worse as l increases. Specifically, the power dissipations at l = 1 and l = 6 are 40.6µW and 35.3µW , respectively, and the area occupations are 150.4µm 2 and 126.7µm 2 , respectively. On the contrary, the NMED degrades from 0.864 × 10 −3 at l = 1 to 0.974 × 10 −3 at l = 6. To effectively see the tradeoffs, the product values are normalized using the corresponding value of the adder with l = 1. Note that the lower the product value is, the better the tradeoff between the hardware costs and accuracy. According to the power-NMED and area-NMED products in Figure 5, the proposed adder has the best tradeoff performance at l = 5, which means a 5-bit constant truncation. In fact, from the power-MRED and area-MRED products' perspective, the best tradeoff of the adder is found at l = 4. While the power-/area-MRED products at l = 4 and l = 5 are almost the same, the value difference between the NMED counterparts at l = 4 and l = 5 are relatively larger than that of the MRED counterparts. Therefore, we use our 16-bit adder design with a parameter of l = 5 for comparison with other adders (i.e., n = 16, k = 8, and l = 5).

C. ACCURACY OF THE PROPOSED ADDER WITH DIFFERENT PARAMETERS
To examine the accuracy performance of the proposed adder under different adder sizes and design parameters, we adjusted parameters k and l of the proposed 32-bit adder (i.e., n = 32). Table 2 lists the error rate, MED, and MRED of the proposed adder at various values of the parameters. Here, we made the approximate part of the adder to have three non-constant bits according to the previous tradeoff analysis and thus parameter l was set to n − k − 3. As the parameter k increases, in other words, the size of the accurate part increases, at a given n, the accuracy performance gets better in terms of error rate, MED, and MRED as expected. The error rate drastically gets worse as k decreases and quickly reaches almost 100%. The MED and MRED values increase more than 15× and 11×, respectively, when the parameter k decreases by 4 at the given k = 32.

D. PERFORMANCE COMPARISON WITH EXISTING ADDERS
We also designed nine existing approximate adders that have similar architectures (LOA, LOAWA, OLOCA, HOERAA, HOAANED, HERLOA, ETAI, CPETA, and SETA) and an accurate adder (RCA) to compare them with our adder. To be fair, the adders were synthesized with the same 32-nm library using Synopsys Design Compiler, and 16-bit adders with an 8-bit RCA-based precise adder were implemented. For the OLOCA and SETA, design parameters of l and i were set to 6 and 7, respectively [15], [21]. Also, the error metrics were extracted under 10 million uniformly generated random input pairs. Furthermore, three more approximate adders that employ different architectures (RAP-CLA, BCSA, and BCSA ERU ) were designed using the identical design methodology and included in the comparison for completeness. The 16-bit adders with an RCA-based sub-adder and its block size of 4 were used for these adders [22], [23].
First, to demonstrate the superiority of our carry prediction technique, we compare the carry prediction error rate of the approximate adders, as shown in Figure 6. Note that the RAP-CLA, BCSA, and BCSA ERU were excluded for this comparison because they do not have the carry to the precise adder due to a different architecture. The absence of the carry prediction in the LOAWA, ETAI, and SETA results in a nearly 50% carry prediction error. Note that the carry input to the precise adder is set to ''0'' in these adders. The LOA, OLOCA, HOERAA, HOAANED, HERLOA, and CPETA include the AND-based carry speculation to the precise adder, which reduces the error rate to approximately 25%. The proposed prediction scheme is the most accurate among all adders and has an error rate of 12.305%, which is identical to the one calculated using Equation (10). Furthermore, our adder achieves error rate reductions of 75.3% and 50.4% on average compared with the adders without carry prediction and with the AND-based carry prediction, respectively. Table 3 summarizes the hardware costs and accuracy performance of various adders. The RCA is the slowest adder because of the long carry propagation chain from the LSB to MSB. The longest delay results in the largest energy (i.e., power-delay product; PDP) consumption although the BCSA ERU dissipates the largest power. The RAP-CLA is the fastest thanks to the relatively shorter carry chain generated by the blocks but occupies the largest area because a significant number of blocks is required to predict the carry for each bit position. Although the RAP-CLA consumes the second largest power, its energy consumption is relatively small and similar to that of the LOA, ETAI, their variants, and the proposed adder ERCPAA. The BCSA and BCSA ERU have almost the same delay but BCSA ERU has slightly larger area, power, and energy consumption than the BCSA while it shows slightly better accuracy performance. The adders that fix some LSB outputs to ''1'' and have AND-based carry prediction (i.e., OLOCA, HOERAA, and HOAANED) show similar hardware characteristics. Furthermore, the HOERAA and HOAANED are almost the same in both hardware and accuracy performance. The lack of a carry prediction to the accurate part allows the corresponding adders (i.e., LOAWA, ETAI, and SETA) to be the fastest among the FA-based adders, however, it results in poor MED and NMED performance compared to other FA-based approximate adders. In terms of area and power, the proposed adder ERCPAA is comparable to the HERLOA. It has the longest delay among the approximate adders due to the proposed carry prediction scheme and causes a relatively larger energy consumption, and it still has 3.12× higher energy efficiency than the RCA. The LOA, LOAWA, ETAI, and SETA have the same error rate of 90.0%, but the LOA has at least 61% better MED and NMED performance than others. The LOA variants that force a few LSB outputs to ''1'' (i.e., OLOCA, HOERAA, and HOAANED) degrade the error rate compared to the LOA, which is up to 99% while maintaining a similar MED and NMED performance. The RAP-CLA, BCSA, and BCSA ERU show very good error rate performance less than 22% but relatively poor MED and NMED performance than the others. These adders can cause computation errors on the higher-order bit positions, whereas errors of other FA-based approximate adders concentrate on the lower-order bit positions (i.e., approximate part). Although the BCSA ERU has the lowest error rate among the approximate adders, the proposed adder has the best MED and NMED performance. Specifically, the proposed adder shows 4.09× and 4.1× greater MED and NMED than the BCSA ERU , respectively.  To effectively make the comparison, the MRED values were normalized using the corresponding values of the LOA. The MREDs of the LOAWA, OLOCA, ETAI and SETA are slightly greater than that of the LOA. In particular, the MRED values of the RAP-CLA, BCSA, and BCSA ERU far exceed those of the others, and their values were inserted outside the bars. Specifically, the RAP-CLA exhibits the worst MRED performance, which is 52.02× greater than the LOA. The HOERAA and HOAANED have almost identical MRED values, which are 27.5% less than the LOA on average. In terms of MRED performance, the proposed adder is comparable to the HERLOA and CPETA. Specifically, the proposed adder reduces the MRED by 41.2% and 98.9% compared to the LOA and RAP-CLA, respectively.

E. JOINT ANALYSIS BETWEEN HARDWARE AND ACCURACY OF APPROXIMATE ADDERS
The error rate is an important metric to assess the accuracy of approximate adders. Unfortunately, its usefulness to evaluate the adder might be limited because it only considers the presence of an error but not the implication (e.g., distance/magnitude) of the error on the additions [40]. Hence, we adopted ED based metrics, such as NMED and MRED, to better represent the accuracy of the adders rather than the error rate in the joint analysis. The power-NMED product is widely used to evaluate approximate adders in terms of power and accuracy jointly [40]. Similarly, the energy-NMED product was considered to analyze the energy aspect [18]. Unfortunately, neither of these two products do not includes the area or delay of approximate adders. The area-delay product (ADP) is a widely employed metric to evaluate hardware resources in terms of area and delay [15]. Therefore, we can consider a new joint metric, the ADP-NMED product, to analyze the tradeoff between area, delay, and accuracy. Figure 8 exhibits the power-NMED, energy-NMED, and ADP-NMED products for 13 approximate adders.
To compare these products effectively among the adders, they were normalized by the corresponding LOA values, and the values were inserted outside the bars. Undoubtedly, the proposed adder outperforms the other approximate adders in all these joint metrics. The RAP-CLA, BCSA, and BCSA ERU show very poor tradeoff performance and the three product values far exceed the other adders because they are a bit faster but consume a larger area, power, and energy than the other adders as shown in Table 3. Also, they exhibit relatively worse accuracy that deteriorates the tradeoff performance. Among the FA-based approximate adders, the ETAI has the worst tradeoff performance and the approximate adders that exclude the carry prediction (i.e., LOAWA and SETA) have similar values with the ETAI. Although the lack of carry prediction allows these adders to be relatively efficient in terms of area, delay, and power, the poor accuracy degrades the overall tradeoff performance so that their three product values are at least 50% higher than those of the LOA. The OLOCA and CPETA have similar power-NMED and energy-NMED products, which are slightly better than the LOA. In addition, the HOERAA and HOAANED are comparable in all three products because of almost identical hardware architecture. The HERLOA is nearly the same power-NMED and energy-NMED products as the HOERAA and HOAANED. However, the larger area occupation stems from the hybrid error reduction scheme results in a higher ADP-NMED product. In summary, our adder has the best tradeoff performance among the compared approximate adders. Specifically, the power-NMED, energy-NMED, and ADP-NMED products of the proposed adder are 95.7%, 91.1%, and 93.2% lower than those of the RAP-CLA, respectively.
Similar to the joint metrics using NMED, we can take into account the metrics using MRED as well. Figure 9 shows the power-MRED, energy-MRED, and ADP-MRED products for the approximate adders. The values that were added outside the bars were normalized by the corresponding LOA values, and the three products using MRED exhibit similar trends to those using NMED. The RAP-CLA, BCSA, and BCSA ERU have significantly larger product values than the others. Specifically, the power-MRED, energy-MRED, and ADP-MRED products of these adders are at least 30×, 23×, and 25× greater than those of the LOA, respectively. Among the FA-based adders, the LOAWA shows the worst performance in the products using MRED, and the values of the ETAI and SETA are close to those of the LOAWA. The LOA and OLOCA have a similar tradeoff performance that the gap between the product values is less than 7%. The proposed adder demonstrates excellent tradeoff performance and is comparable to the HOERAA, HOAANED, HERLOA, and CPETA in all three products. Particularly, our adder achieves reductions in the power-MRED, energy-MRED, and ADP-MRED products, respectively, of 99.4%, 98.9%, and 99.1% of the RAP-CLA.
Finally, to evaluate the approximate adders in terms of the various hardware costs together with the accuracy performance (i.e., NMED), we define the following product as a figure of merit (FoM) for the approximate adders.
In (14), the better energy efficiency, higher speed, and smaller area with good accuracy performance in the error distance for the approximate adders result in a smaller value for this FoM. Figure 10 exhibits the FoM of the approximate adders, and the values were normalized by the corresponding values of the LOA. Note that the lower the FoM value is, the better the approximate adder performance. The LOA, HERLOA, and CPETA have similar FoM values, and so do the LOAWA, ETAI, and SETA. Unfortunately, the FoMs of the RAP-CLA, BCSA, and BCSA ERU are much greater than those of the other FA-based approximate adders and the numbers outside the bars indicate their FoM values. Specifically, the normalized FoM of these adders reaches greater than 10 because the poor NMED performance severely deteriorates the FoM, even though they are relatively faster than the others. The proposed adder ERCPAA is similar but shows better FoM performance than the OLOCA, HOERAA, and HOAANED. Obviously, the excellent design tradeoff between hardware costs and accuracy (i.e., NMED) allows our design to be the most competitive adder among the approximate adders considered here. Particularly, the FoM of our adder is 93.05% smaller than that of the RAP-CLA.

V. APPLICATIONS OF APPROXIMATE ADDERS
Approximate adders can be utilized in many error-tolerant applications. To examine the effectiveness of the proposed adder in practical applications, we adopted our adder and existing approximate adders in a couple of applications and compared their performance.

A. DIGITAL IMAGE PROCESSING
First, the approximate adders were applied to digital image processing. Particularly, we considered Gaussian smoothing filtering, which is achieved by a 2-D convolution of an image and Gaussian kernel and used the peak signal-tonoise ratio (PSNR) to measure the output image quality. The  following 5 × 5 Gaussian kernel G is used for filtering [39]. 1  3  6  3  1  3 15 25 15 3  6 25 41 25 6  3 15 25 15 3  1  3  6  3 1 For the Gaussian smoothing operation, the addition was performed using an accurate adder as well as the proposed and existing approximate adders, whereas multiplication and division were performed accurately. Additionally, since Gaussian smoothing filtering is useful to reduce image noise, we added zero-mean, Gaussian white noise with a variance of 0.01 to the original lena image, which is a grayscale image VOLUME 9, 2021 FIGURE 12. Original data and clustered data with WCSS by k-means clustering using various adders.
with a size of 512 × 512, and then performed filtering [41]. We employed an accurate adder (RCA), the proposed adder, and 12 existing approximate adders in the filtering. The PSNR values were calculated against the images obtained by applying Gaussian filtering to the original input image using the accurate adder. First, the approximate adders with design parameters of n = 16 and k = 8 were applied to the filtering and we found out that all approximate adders, except for the RAP-CLA, BCSA, and BCSA ERU whose block sizes were set to 4, produce visually very similar output images, although our adder generates the best image quality with the highest PSNR. Therefore, to make the output images more visually distinguishable, we reduced the size of the accurate part to 3 and the block size of the approximate adders by half. Figure 11 shows the original noisy image and output images of Gaussian smoothing filtering using various adders. The BCSA shows the worst PSNR value of 8.20dB among the images. The PSNR value of 8.25dB is identical to the output images processed by the LOAWA, ETAI, and SETA. Similarly, the LOA and OLOCA generate the same output image quality. The PSNRs of images with the HOERAA, CPETA, and BCSA ERU range from 9.83dB to 10.93dB. In other words, the image quality processed by these adders is between those processed by the LOA/OLOCA and LOAWA/ETAI/SETA. The HOAANED, HERLOA, and RAP-CLA yield slightly better output images than the LOA/OLOCA. The proposed adder produces the best image quality distinctly seen in human vision with a PSNR value of 20.84dB, which means that the filtered image is the closest to the one generated by the accurate adder. This confirms that the approximation errors of the proposed adder have a negligible impact on the processing quality and thus, it is suitable for digital image processing applications. To further examine the approximate adders in the application, we performed the Gaussian smooth filtering for eight more well-known benchmark images (cameraman, peppers, baboon, F-16, couple, fishing boat, clock, and airplane) obtained from [42]. Note that the same white noise was added to these images. The PSNRs of the filtered output images generated by the approximate adders are listed in Table 4. All images exhibit a similar PSNR trend with the lena image. Evidently, our ERCPAA achieves the best PSNR value for all benchmark images among the approximate adders in the Gaussian smoothing filtering application.

B. MACHINE LEARNING
In addition to the filtering application, we also took machine learning into consideration to explore the efficacy of the proposed adder. Specifically, we examined the performance of the approximate adders in k-means clustering, which is an unsupervised machine learning algorithm and extensively utilized in data mining [43]. Basically, the algorithm groups a set of unlabeled data points into k different clusters that each data point belongs to only one cluster. When clustering, it minimizes the sum of distances between the data points to the centroids of the corresponding clusters, which is defined by the within cluster sum of squares (WCSSs). Therefore, it iteratively calculates the distances where the subtraction operation is mainly used in this algorithm. We applied the approximate adders to the operation [28]. Note that the subtraction can be done by 2's complement addition. We obtained an unlabeled dataset comprising 1000 data points from [44] and set the number of clusters k to 5. Figure 12 demonstrates the visualized 2-D original dataset and clustered dataset using the accurate adder, the existing approximate adders, and the proposed adder. The WCSS values were extracted to evaluate the quality of the clustering results using the difference adders [28]. The value closer to the one clustered by the accurate adder indicates a better clustering result. The LOAWA, ETAI, and SETA show a similar clustering result, and so do the LOA and OLOCA. The LOA/OLOCA produce much better clustering quality than the LOAWA/ETAI/SETA because the latter does not include any carry prediction logic to the precise adder and this degrades computation accuracy. The proposed approximate adder exhibits the best clustering result closest to the one using the accurate adder. The HOERAA, HOAANED, HERLOA, and CPETA yield slightly worse results than the proposed adder. Unfortunately, the RAP-CLA, BCSA, and BCSA ERU show poor clustering performance and do not allow the dataset to be partitioned properly. Specifically, the WCSS values of these adders are up to 384% and 378% larger than those of the accurate and proposed adders, respectively. In summary, the proposed adder has the best performance in terms of WCSS in k-means clustering as well.

VI. CONCLUSION
In this paper, we presented a new approximate adder that combines error-reduced carry prediction and constant truncation with error reduction schemes. The proposed carry prediction scheme achieves an error rate reduction of up to 75% compared to the existing approximate adder, and the proposed error reduction technique improves the overall computation accuracy by decreasing the error distance. We systematically analyzed our design and sought the best tradeoff between hardware costs and accuracy by adjusting the adder design parameter. When implemented in the 32-nm CMOS technology, the proposed design has 1.90× and 3.12× greater power-and energy efficiency, respectively, than the RCA, with NMED and MRED improvements of up to 91.4% and 98.9%, respectively, compared to the existing approximate adders. Importantly, our design achieves 95.7%, 91.1%, and 93.2% reductions in the power-NMED, energy-NMED, and ADP-NMED products, respectively, compared to the RAP-CLA due to an excellent design tradeoff. Our adder also reduces the power-, energy-, and ADP-MRED products by up to 99.4% compared to the others. Particularly, in terms of the FoM considering hardware resources (i.e., energy, delay, and area) and the accuracy performance (i.e., NMED), the proposed adder is up to 93.05% better than the RAP-CLA. The proposed adder has been adopted in a digital image processing application and proves that the proposed adder rarely affects the output image quality that is the closest to the one with the accurate adder. Additionally, we have demonstrated the performance of our adder in a machine learning application and the result has shown that the proposed adder outperforms the other approximate adders. Therefore, the proposed adder is well applicable to energy-efficient and error-tolerant applications, such as machine learning, neuromorphic computing, and digital signal processing.

ACKNOWLEDGMENT
(Jungwon Lee and Hyoju Seo contributed equally to this work.)