Low Error Efficient Approximate Adders for FPGAs

In this paper, we propose a methodology for designing low error efficient approximate adders for FPGAs. The proposed methodology utilizes FPGA resources efficiently to reduce the error of approximate adders. We propose two approximate adders for FPGAs using our methodology: low error and area efficient approximate adder (LEADx), and area and power efficient approximate adder (APEx). Both approximate adders are composed of an accurate and an approximate part. The approximate parts of these adders are designed in a systematic way to minimize the mean square error (MSE). LEADx has lower MSE than the approximate adders in the literature. The 32-bit LEADx with 16-bit approximation has 20% lower MSE than the approximate adder with the lowest MSE in the literature. The 16-bit APEx with 8-bit approximation has the same area, 60% lower MSE, and 4.5% less power consumption in Xilinx Virtex 7 FPGA than the smallest and lowest power consuming approximate adder in the literature. APEx has smaller area and lower power consumption than the other approximate adders in the literature. As a case study, the approximate adders are used in video encoding application. LEADx provided better quality than the other approximate adders for video encoding application. Therefore, our proposed approximate adders can be used for efficient FPGA implementations of error tolerant applications.


I. INTRODUCTION
Approximate computing trades off accuracy to improve the area, power, and speed of digital hardware. Many computationally intensive applications such as video encoding, video processing, and artificial intelligence are error resilient by nature due to the limitations of human visual perception or nonexistence of a golden answer for the given problem. Therefore, approximate computing can be used to improve the area, power, and speed of digital hardware implementations of these error tolerant applications.
A variety of approximate circuits, ranging from system level designs [1]- [4] to basic arithmetic circuits [5], have been proposed in the literature. Adders are used in most digital hardware, not only for binary addition but also for other binary arithmetic operations such as subtraction, multiplication, and division [6]- [8]. Therefore, many approximate adders have been proposed in the literature [9]- [24]. All The associate editor coordinating the review of this manuscript and approving it for publication was Stavros Souravlas . approximate adders exploit the fact that critical path in an adder is seldom used.
Approximate adders can be broadly classified into the following categories: segmented adders [10], which divide n-bit adder into several r-bit adders operating in parallel; speculative adders [9], which predict the carry using only the few previous bits; and approximate full-adder based adders [12]- [17], which approximate the accurate full-adder at transistor or gate level. Segmented and speculative adders usually have higher speeds and larger areas than accurate adders [5], [13]. Approximate full-adder based approximate n-bit adders use m-bit approximate adder in the least significant part (LSP) and (n − m)-bit accurate adder in the most significant part (MSP), as shown in Fig. 1.
Most of the approximate adders in the literature have been designed for ASIC implementations. These approximate adders use gate or transistor level optimizations. Recent studies have shown that the approximate adders designed for ASIC implementations either do not yield the same area, power, and speed improvements when implemented on FPGAs or fail to utilize FPGA resources efficiently to improve the output quality [20], [26]. This is mainly due to the difference in the way logic functions are implemented in ASICs and FPGAs. The basic element of an ASIC implementation is a logic gate, whereas FPGAs use lookup tables (LUTs) to implement logic functions. Therefore, ASIC based optimization techniques cannot be directly mapped to FPGAs.
FPGAs are widely used to implement error-tolerant applications using addition and multiplication operations. The efficiency of FPGA-based implementations of these applications can be improved through approximate computing. Only a few FPGA specific approximate adders have been proposed in the literature [19]- [23]. These approximate adders focus on improving either the efficiency or accuracy. Therefore, the design of low error efficient approximate adders for FPGAs is an important research topic.
In this paper, we propose a methodology to reduce the error of approximate adders by efficiently utilizing FPGA resources, such as unused LUT inputs. We propose two approximate adders for FPGAs using our methodology based on the architecture shown in Fig. 1.
We propose a low error and area efficient approximate adder (LEADx) for FPGAs. It has lower mean square error (MSE) than the approximate adders in the literature. It achieves better quality than the other approximate adders for video encoding application.
We also propose an area and power efficient approximate adder (APEx) for FPGAs. Although its MSE is higher than that of LEADx, it is lower than that of the approximate adders in the literature. It has the same area, lower MSE and less power consumption than the smallest and lowest power consuming approximate adder in the literature. It has smaller area and lower power consumption than the other approximate adders in the literature.
We provide mathematical models to estimate the error rate (ER), MSE, and mean absolute error (MAE) of the proposed approximate adders. We compare the proposed approximate adders with the approximate adders in the literature.
The rest of the paper is organized as follows. Section II provides an overview of related works and the necessary background to understand the proposed approximate adders. Section III presents the proposed approximate adders and the mathematical models to compute their error metrics. Their error analyses and implementation results are given in Section IV. Section IV also presents the results of using approximate adders in video encoding as a case study. Finally, Section V concludes the paper.

A. RELATED WORKS
Bit truncation in least significant bit positions is a well-known approximation technique. In truncate adder, the output of LSP is fixed to zero. Although, the truncate adder provides significant improvements in speed, area, and power consumption, it has high error rate and MSE [12], [13].
Lower-part-OR adder (LOA) is proposed in [14]. Its LSP consists of 2-input OR gates, whereas the MSP is accurate. A carry is sent to the MSP if it is generated at most significant bit position of the LSP. An approximate adder, OLOCA, is proposed in [15] by optimizing the LOA architecture. OLOCA uses only two OR gates in the LSP to compute the two most significant sum bits. Rest of the LSP is approximated to a fixed value. An approximate adder with near-normal error distribution (HOAANED) is proposed in [16]. HOAANED has similar architecture to OLOCA, however, it uses more resources to compute the two most significant sum bits of LSP. Therefore, HOAANED has better quality than OLOCA at the expense of slight increase in area.
Dutt et al. [17] proposed an approximate full adder based multibit adder (AFA). The sum of each bit of LSP is computed accurately whereas its respective carry out is equated to one of the inputs.
In recent years, a few approximate adders are proposed specifically for FPGAs. A LUT-based approximate adder (LBA) is proposed in [19]. The LSP and MSP, both perform accurate addition. A carry is passed to MSP only if it is generated at the most significant bit (MSB) of the LSP. If any other carry, that needs to be propagated to the MSP, is detected, then all bits of LSP are set to 1. LBA has high accuracy, but it does not provide performance improvement compared to the accurate adder synthesized by FPGA synthesis tool [20].
A methodology to design approximate adders (DeMAS) for FPGAs is presented in [20]. The methodology is based on an optimized truth table of approximate full-adder. Eight different variants of multibit approximate adder are presented using the optimized truth table. All these variants use same number of LUTs but differ in their error metrics.
Quaternary addition based approximate adder using the fast carry chains of FPGAs is presented in [21]. The accurate quaternary adder uses two carry inputs and generates two carry outputs. However, the authors in [21] proposed to use only one carry in the quaternary addition, hence generating an approximate result.
A single exact dual adder (SEDA) is proposed for FPGAs in [22]. The adder can either perform accurate addition of single n-bit input or approximate addition of two n-bit inputs. Carry of 2-bit addition is computed accurately, while the sum bits are equated to inverse of carry out.
High speed segmented approximate adders (xUAV) for FPGAs are proposed in [23]. Segmentation is done in 2, 3, or 5-bit groups for efficient mapping to LUTs. However, the proposed adders use more area and consume more power than accurate adder. These adders also have very large MAE and MSE as the size of adder is increased.

B. LENGTH OF CARRY
The key principle of approximate addition is to shorten the critical path of an adder by breaking the carry chain at one or multiple positions. This technique improves the speed of an adder at the expense of accuracy loss. In this section, we briefly explain the rationale for this technique.
The length of a carry signal in n-bit binary addition is defined as the number of bits it propagates before being killed or regenerated. For example, if a carry signal is generated at ith bit position and killed or regenerated at jth bit position (j > i), the length of that carry signal is defined as j − i bits.
In n-bit binary addition, the outgoing carry signal at any bit position i is determined by the current and previous input bits. Bit position i is said to generate a carry if both the input bits at ith position are 1, propagate the incoming carry if both the input bits at ith position are different, and kill the incoming carry if both the input bits at ith position are 0.
In the worst case, a carry signal is generated in the least significant bit (LSB) and propagated to the most significant bit (MSB). In this case, the length of carry signal is equal to the adder bit width. However, the worst case rarely happens, and the average length of a carry signal is usually much shorter than the adder bit width [9].
We implemented and simulated n-bit accurate adder using 10 7 independent random number pairs extracted from uniformly distributed sample space between 0 and 2 n − 1. Based on these simulation results, probability of the length of a carry signal being equal to L bits is given in Table 1. As can be seen from this table, the length of a carry signal is rarely longer than 5 bits. The length of a carry signal is shorter than 5 bits with more than 90% probability.
Since the worst case of carry propagation (length of carry = n-bits) rarely happens, in most cases, the carry can be correctly predicted by considering only a few previous input bits.

C. XILINX VIRTEX FPGA
The main logic resource in a Xilinx Virtex FPGA is configurable logic blocks (CLBs) [27]. Each CLB contains two slices. Simplified architecture of a slice in Xilinx Virtex 7 FPGA is shown in Fig. 2. Each slice is composed of 4 such elements with carry-chain cascaded in series. Therefore, each slice has four 6-input LUTs. Each LUT can be used to implement two 5-input combinational logic functions or one 6-input combinational logic function. Furthermore, each slice also contains a 4-bit carry-chain and eight flip-flops.
An efficient FPGA-based implementation should be able to effectively utilize these resources. This is particularly important for implementing arithmetic functions that can utilize the fast carry-chains. Therefore, it is important to understand how the arithmetic operations are implemented on FPGAs. Particularly, we consider the mapping of a full adder to a Xilinx FPGA.
Typically, a full adder is implemented as shown in (1) and (2), where A and B represent the inputs, S is sum, and C IN and C OUT are carry-in and carry-out, respectively.
However, when implementing a full adder on a Xilinx FPGA, the synthesis tool rewrites (2) as (3).
This simplification allows the reuse of A ⊕ B logic function for computing both S and C OUT [28]. This term is used as input to XOR gate for sum computation and as select input of mux for selecting the appropriate signal for C OUT . However, since only 4-bit carry chain is available in a slice, an n-bit adder uses n LUTs such that 4 inputs and one output of each LUT are not used. These unused resources can be utilized to implement additional logic with an adder without increasing area.

III. PROPOSED DESIGN METHODOLOGY
The proposed design methodology uses the approximate fulladder based n-bit adder architecture shown in Fig. 1. n-bit addition is divided into n-bit approximate adder in the LSP and (n − m)-bit accurate adder in the MSP. Breaking the carry chain at bit-position m generally introduces an error of 2 m in the final sum. The error rate and error magnitude can be reduced by predicting the carry-in to the MSP (C MSP ) more accurately and by modifying the logic function of LSP to compensate for the error.
The carry to the accurate part can be predicted using any k-bit input pairs from the approximate part such that k ≤ m. Most of the existing approximate adders use k = 1.
As discussed in Section II, FPGA implementation of accurate adder uses only 2 inputs and 1 output of each 6-input LUT. We propose to utilize the remaining 4, available but unused, inputs of the first LUT of the MSP to predict C MSP . Therefore, we propose to share the most significant 2 bits of both inputs of the LSP with the MSP for carry prediction.
Sharing more bits of LSP with MSP will increase the probability of correctly predicting C MSP which will in turn reduce error rate. However, this will also increase the area and delay of the approximate adder.
To analyze the tradeoff between the accuracy and performance of an FPGA-based approximate adder with different values of k, we performed synthesis and simulation experiments on a Xilinx Virtex 7 FPGA.
The results for a 64-bit adder with 12-bits LSP using k bits to predict C MSP are shown in Table 2. For k > 2, the error rate reduces slightly at the cost of increased area and delay. On the other hand, for k < 2, the delay improves marginally at the cost of significant increase in the error rate.
Therefore, we propose using k = 2, as it provides good balance between accuracy and performance of approximate adders for FPGAs. In the proposed approximate adders, a carry is passed to the MSP if it is generated at bit position m − 1, or generated at bit position m − 2 and propagated at bit position m − 1. The C MSP can be described by (4) where G i and P i are the generate and propagate signals of the ith bit position, respectively.
The error in higher bit positions has more impact on the error magnitude of an approximate adder. As described in (4), the carry-in to MSP is predicted using two most significant bits of LSP. These 2 bits effectively implement a 3-output function {C MSP S m−1 S m−2 }. An error occurs in the n-bit addition if a carry (C m−2 ) is generated at bit position  i < (m − 2) and that carry should be propagated to MSP. In this case, the correct result should be {C MSP S m−1 S m−2 } = 100. However, without any error reduction mechanism the approximate result will be {C MSP S m−1 S m−2 } = 000.
To reduce the error magnitude, we propose a 2-bit approximate adder (AAd1) for computing S m−1 and S m−2 . The functionality of AAd1 is described by (5) and (6). AAd1 is implemented using a single LUT as shown in Fig. 3. When C m−2 = 1, P m−2 = 1, and P m−1 = 1, the approximate result will be {C MSP S m−1 S m−2 } = 011, only 1 less than the accurate result. For all other inputs, it will generate the accurate result.
For uniformly distributed inputs, the carry-in has equal probability of being 1 or 0. The probability of inputs at bit position i propagating a carry is P i = 1/2. Therefore, in the proposed n-bit approximate adders, the probability of S m−2 and S m−1 generating an error is 0.125 as shown in (7). Throughout this paper, E x represents the cases when hardware x generates an error. Architecture of the proposed approximate adders is shown in Fig. 4. It uses 2 MSBs of LSP to predict the C MSP , whereas their respective sum bits are computed using AAd1. AAd1 is only suitable when the C out of 2-bit inputs is predicted accurately. Accurate prediction of C out requires additional resources or unused LUT inputs. Therefore, to design area efficient approximate adders for FPGAs, AAd1 is not used in the least-significant m − 2 bits of the LSP. In this paper, we propose two n-bit approximate adders using the architecture in Fig. 4. The two proposed n-bit approximate adders use different approximate functions for the first m − 2 bits of the LSP.

A. PROPOSED LOW ERROR AND AREA EFFICIENT APPROXIMATE ADDER FOR FPGAs
In this section, we propose a low error and area efficient approximate adder (LEADx) for FPGAs.
State-of-the-art FPGAs use 6-input LUTs. These LUTs can be used to implement two 5-input functions. The complexity of the implemented logic function does not affect performance of LUT based implementation. A 2-bit adder has 5 inputs and two outputs. Therefore, a LUT can be used to implement a 2-bit approximate adder.
For an area efficient FPGA implementation, we propose to split the first m − 2 bits of LSP into (m − 2)/2 groups of 2-bit inputs such that each group is mapped to a single LUT. Each group adds two 2-bit inputs with carry-in using an approximate 2-bit adder (AAd2).
To eliminate the carry chain in LSP, we propose to equate C out of ith group to one of the inputs of that group (A i+1 ). This results in error in 8 out of 32 possible cases with an absolute error magnitude of 4 in each erroneous case. To reduce the error magnitude, we propose to compute the S i and S i+1 output bits as follows: • If the C out is predicted correctly, the sum outputs are also calculated accurately using standard 2-bit addition.
• If the C out is predicted incorrectly and the predicted value of C out is 0, both sum outputs are set to 1.
• If the C out is predicted incorrectly and the predicted value of C out is 1, both sum outputs are set to 0. This modification reduces the absolute error magnitude to 2 in two cases, and to 1 in the other six cases. The resulting truth table of AAd2 is given in Table 3. The error cases are shown in red. Since AAd2 produces an erroneous result in 8 out of 32 cases, the error probability of AAd2 is 0.25 as shown in (8).
The proposed LEADx approximate adder is shown in Fig. 5. An n-bit LEADx uses (m − 2)/2 copies of AAd2 adder in the least significant m − 2 bits of the approximate adder architecture shown in Fig. 4. In LEADx, C m−2 = A m−3 . AAd2 implements a 5-to-2 logic function that is mapped to a single LUT. Similarly, AAd1 is also mapped to a single LUT. Therefore, m/2 LUTs are used for the LSP. These LUTs work in parallel. Therefore, the delay of LSP is equal to the delay of a single LUT (t LUT ). The critical path of LEADx is from the input A m−2 to the output S n−1 . Fig. 6 shows an example of the functionality of 16-bit LEADx with 8-bit approximation. The outputs of bits enclosed in dotted lines are computed using AAd1. The outputs of the other bits of the approximate part (LSP) are computed using three copies of AAd2. The carry-in to the accurate part (C MSP ) is predicted from the two MSBs of LSP as shown in (4).
The error probability of n-bit LEADx depends on the number of approximate 2-bit adders used in the approximate part. Error in any of these 2-bit adders can contribute to the error in the sum output. Therefore, the error probability of LEADx is given as the union of error probabilities of the individual 2-bit approximate adders. Let there be N copies of AAd2 in LEADx and E AAd2−i represents the error in ith copy of AAd2, then the error probability of LEADx can be calculated as shown in (9).
Since error in two or more of these 2-bit adders can occur concurrently, occurrence of error in these adders are not mutually exclusive. Therefore, (9) can be evaluated using  inclusion-exclusion principle [29]. For example, the error probability of LEADx with 4-bits LSP, for uniformly distributed inputs, can be calculated as shown in (10).
Similarly, the error probability of LEADx with 6-bits LSP, for uniformly distributed inputs, can be calculated as shown in (11). In this section, we propose an area and power efficient approximate adder (APEx) for FPGAs. APEx is also based on the approximate adder architecture shown in Fig. 4. For the least significant m − 2 bits of the LSP, the aim is to find an approximate function with no data dependency. Carry should neither be generated nor used for sum computation. A 1-bit input pair at any bit position i ≤ (m − 2) should produce a 1-bit sum output only.
In general, any logic function with 1-bit output can be used as an approximate function to compute the approximate sum of 1-bit inputs at ith bit position. A constant 0 or constant 1 at the output are also valid approximate functions. Fixing the output to 0 or 1 will reduce the area and power consumption of the approximate adder because no hardware will be required for sum computation.
We evaluated error metrics of both constant functions for 1-bit addition, as shown in Table 4. Fixing the output to 0 introduces error in 3 out of 4 cases with an average error (AE) of −1 and MSE of 3/2 for uniformly distributed inputs. Fixing the output to 1 introduces error in 2 out of 4 cases with 0 AE and MSE of 1/2 for uniformly distributed inputs. Therefore, constant 1 provides a better approximation.
We further analyze the error metrics of n-bit approximate adder architecture shown in Fig. 4    If the least significant m − 2 bits are fixed to 1, the ME occurs when the inputs A 0 to A m−3 and B 0 to B m−3 are all 0. With accurate addition, S 0 to S m−3 output bits are all 0 and carry is not propagated to m − 2 bit position. Fixing S 0 to S m−3 to 1 and carry-in for m − 2 bit position to 0 results in ME of 2 m−2 − 1. The ME of constant 1 is less than the ME of constant 0.
Furthermore, assume that a constant value V is used to approximate the function F = A + B. The resulting absolute error is defined as |F − V |. The aim is to find a constant value V such that MSE is minimized. This is a well-known problem with a well-defined solution: using mean of distribution of F as V minimizes the MSE [30], [31].
Let us consider that A and B have uniform input distribution with values between 0 and 2 n − 1, then F has a symmetric triangular distribution in the range [0, (2 n+1 − 2)] [18]. In the case of symmetric distribution, the mean and median are the same and located at the center of the sample space [32]. Therefore, mean and median of F are located at 2 n − 1, which is the halfway point of [0, 2 n+1 − 2]. The binary representation of 2 n − 1, in n + 1 bit sample space, is 0111 . . . 1. Therefore, using constant 1 as the sum output and 0 as carry-out minimizes the MSE of the approximate output. If i bits are fixed to 1, the probability of error in the sum output is calculated as shown in (12).
In the proposed APEx, the S 0 to S m−3 outputs are fixed to 1 and the C m−2 is 0. This provides significant area and power consumption reduction at the expense of slight quality loss. It is important to note that this is different from bit truncation technique which fixes both the sum and carry outputs to 0. The ME of truncate adder is 2 m+1 − 2 which is much higher than ME of APEx (2 m−2 − 1). The proposed APEx approximate adder is shown in Fig. 7. Same as LEADx, the critical path of APEx is from the input A m−2 to the output S n−1 . Similar to (9), the error probability of APEx can be calculated as shown in (13). When C m−2 is 0, E AAd1 reduces to 0 according to (7). Therefore, the error probability of APEx depends only on the number of output bits fixed to 1.

IV. EXPERIMENTAL RESULTS AND DISCUSSION
In this section, we present experimental results of the proposed approximate adders, LEADx and APEx. We compare LEADx and APEx with other FPGA-specific approximate adders in the literature: LBA [18], DeMAS [19], and SEDA [21]. DeMAS can be built using different configurations. For a given number of approximate bits, each of these configurations has the same area. Therefore, we chose the configuration with the lowest average error for comparison.
We also compare LEADx and APEx with power and area efficient ASIC-based approximate adders in the literature: AFA [17], HOAANED [16], and LOA [14]. Each of these approximate adders is based on the approximate adder architecture shown in Fig. 1, where approximation is done only in the LSP and the MSP is kept accurate. We also compare the proposed approximate adders with the segmented and speculative approximate adders in the literature.

A. ERROR METRICS
The functional models of these approximate adders are implemented in C++. Error metrics of these approximate adders are determined using their functional models for 16, 32, and 64-bit addition, with varying number of approximate bits, using 10 7 uniform random numbers as inputs.
The error value for each input is calculated by subtracting the accurate result from the approximate result. Error value may be positive, negative, or zero. The average error (AE) is defined as the average of all the error values. MAE, also known as mean error distance [33], is the average of the absolute values of all the error values. MAE is always positive. MSE is the average of the squares of all the error values. RMSE is the square root of MSE. The MAE and MSE of LEADx can be calculated using (14) and (15), respectively. Similarly, the MAE and MSE of APEx can be calculated using (16) and (17), respectively. An empirical approach is used to determine these mathematical models, i.e., these formulas are determined using experimental results.
MSE LEADx ≈ 2 2m−7 + 2 2m−11 − 2 1.5m−9.2 (15) As can be observed in these equations, error metrics of the proposed approximate adders depend only on the number of approximate bits (m), and they are independent of the bit width (n) of the adder.
Error metrics and the error distribution of 16-bit approximate adders with 8-bit approximation are shown in Fig. 9. The error distribution is plotted as a function of error value and its respective percentage occurrence. As can be seen in Fig. 9, the maximum errors of the proposed approximate adders are less than those of other approximate adders.
The error distribution of LEADx is skewed to the negative side. This indicates that, in most of the cases, the result of LEADx is less than the accurate result, leading to a negative AE. Whereas, plotting the error distribution of APEx results in a symmetrical triangular shape centered at zero, indicating that APEx has equal probability of negative and positive errors. Therefore, APEx has almost zero AE.
The error distribution of LBA indicates that its erroneous output is always less than the accurate result. All other approximate adders in the literature have almost symmetrical error distribution. However, their error values are spread over a wide range, resulting in much larger MAE and MSE as compared to the proposed approximate adders.
The error metrics of 64-bit adders with 4 to 12-bits of approximation are reported in Table 5. Our proposed approximate adders have the lowest MSE. The MSE of the LEADx is at least 20% less than that of the approximate adders in the literature.
LBA has the lowest MAE. However, it has the worst area and power consumption results, as reported in the next section. The MAE of the proposed approximate adders is second only to that of LBA.
The ER of LEADx and APEx validate the analytical error probability results given in Section II. All the approximate adders, except LBA and LEADx, have high ER.
These adders follow the fail-small approach [25]. In the fail-small approach, even if ER is high, error magnitudes are small. The rationale behind this approach is that small errors are naturally masked by algorithms, and they have less impact on MSE. Therefore, they slightly degrade the quality of applications.
The error magnitude of our proposed approximate adders is significantly reduced by accurately predicting the carry to the MSP using unused LUT inputs. AAd1 and AAd2, both fully utilize the LUT inputs to achieve low error. The VOLUME 9, 2021 LEADx is designed in a way that not only the error values are reduced but also the number of error cases are reduced. The experimental results show that LEADx has indeed higher accuracy and lower MSE than the other approximate adders. Similarly, the logic function of the approximate part of APEx is determined to reduce the MSE. The experimental results show that the MSE of APEx is indeed less than that of the approximate adders in the literature.

B. IMPLEMENTATION RESULTS
All the approximate adders are implemented using Verilog HDL. The accurate part of all the adders is identical and implemented using addition operator. Verilog RTL codes are synthesized and implemented on a Xilinx Virtex 7 FPGA with speed grade 3 using Vivado 2020.1. AreaOptimized_high strategy is used for synthesis, and default strategy is used for implementation.
The quality metrics are extracted from post-implementation timing simulations using 1 million uniform random numbers. The quality metrics are cross verified with C++ simulations. For power estimation, switching activity interchange format (SAIF) files are also generated from these postimplementation timing simulations at 100 MHz for all adders. The power consumption of each approximate adder FPGA implementation is estimated with Vivado 2020.1 using the corresponding SAIF file.
The implementation results of 16-bit adders with 8-bit approximation are given in Table 6. All the adders are implemented with input and output registers. SEDA and LBA are slower than the accurate adder because of carry propagation in their LSPs. All other 16-bit approximate adders have the same delay as the accurate adder. It is important to note that their delay is limited by the maximum frequency of Virtex 7 FPGA. It does not necessarily mean that the critical path of these adders is the same.
All the approximate 16-bit adders, except LBA, use fewer LUTs than the accurate adder. Since an accurate adder is used in the MSP of all these adders, the reduction in LUTs occurs only in the LSP. Since LEADx performs 2-bit addition in a single LUT, its LSP uses 50% fewer LUTs than the accurate adder.
APEx and HOAANED use the lowest number of LUTs. For these two adders, a significant reduction in number of LUTs occurs because of the use of constant functions in their LSPs. For other approximate adders, the reduction in number of LUTs occurs because of the approximation techniques used, which allow the synthesis tool to merge two sum outputs to a single LUT.
LEADx consumes slightly less power than the accurate adder. APEx consumes the lowest power among all the approximate adders. For the 16-bit adder with 8-bit approximation, the power consumption of APEx is 29% less than that of the accurate adder and 4.5% less than that of the second lowest power consuming adder, HOAANED.
The LUTs vs MSE and power vs MSE graphs of 32-bit approximate adders are given in Fig. 10. These results are plotted for 4-bit to 20-bit approximation in a 32-bit adder. The 32-bit accurate adder uses 32 LUTs and consumes 10.75 mW power.
While the number of LUTs used by most of the approximate adders decreases linearly with the increase in approximation, their respective power reductions do not follow the same trend. However, APEx provides significant power reduction compared to the accurate adder at the cost of a slight loss in accuracy.
LUTs, power consumption and delay reductions achieved by 64-bit approximate adders with 16-bit approximation compared to 64-bit accurate adder are shown in Fig. 11. LEADx reduced the LUTs by 12.5% compared to the accurate adder. APEx reduced the LUTs by 23.4% and power consumption by 21% compared to the accurate adder.
LBA performs worse than the accurate adder in all these metrics. Among other FPGA specific adders, DeMAS provided no power reduction but reduced the LUTs by 11% compared to the accurate adder. The performance of HOAANED is compatible with APEx. However, as discussed earlier, it has lower quality than both LEADx and APEx.
These results show that our proposed LEADx has smaller area, lower power, and better quality than the FPGA specific adders in the literature. The results show that DeMAS is the most efficient FPGA specific approximate adder in the literature. With 8-bits approximation, LEADx has 7% smaller area and 86% lower MSE than DeMAS. LOA is one of the most efficient ASIC-based approximate adders in the literature [5]. LEADx has better quality than LOA at the same cost when implemented on an FPGA. With 8-bits approximation, LEADx has 87% lower MSE than LOA at the same cost. HOAANED is suitable for FPGA implementation. However, APEx has less power and better quality than HOAANED at the same cost, when implemented on an FPGA. APEx has more than 60% lower MSE than HOAANED at the same cost.
The quality and implementation results of 16-bit adders with different approximation amounts are given in Table 7. These adders have same delay (1.35 ns). These adders are implemented with input and output registers. Therefore, although their critical paths are different, their speed is limited by the maximum frequency supported by Virtex 7 FPGA.
xUAV is an FPGA-specific segmented adder. Several configurations of xUAV are proposed in [23]. We used two most  efficient configurations; one with the lowest error (m = 5, r = 1) and the other with low error and low area (m = 3, r = 1).
The segmented and speculative adders follow fail-rare approach [25]. They have low ER. But their error magnitudes are usually large. Therefore, these adders have high MAE and MSE. For example, ACA-I with 8-bit segmentation has only 1.5% ER. However, its ME is 2 15 . Most of the errors that occur in ACA-I have large magnitude, resulting in significantly high MSE.
Among the segmented and speculative adders, BCSA with 8-bit segmentation has the best quality. ETA-II and ACA-II have similar architecture. Therefore, their error metrics are similar. However, for 8-bit segmentation, ACA-II is more area efficient than ETA-II.
LEADx and APEx have better quality, smaller area, and lower power consumption than the segmented and speculative adders. These results show that, for uniformly distributed inputs, fail-small approach gives better quality than fail-rare approach.

D. CASE STUDY: MOTION ESTIMATION IN VIDEO ENCODING
We also assessed the impact of the proposed approximate adders and the other approximate adders on video encoding quality. C++ implementations of 8-bit adders with 4-bit approximation are integrated into High Efficiency Video Coding (HEVC) reference software HM 16.14 video encoder.
The approximate adders are used for sum of absolute difference (SAD) computations for motion estimation (ME). ME accounts for approximately 70% of the computational complexity of video encoding [13]. The search strategy is set to fast test zone search (TZ). The quality results are obtained for four video sequences with different spatial resolutions.
For each approximate adder, PSNR result in dB and the percentage increase in bitrate ( BR) with respect to using accurate adder are shown in Table 8. LEADx has the least quality loss, i.e., lowest PSNR decrease and lowest bitrate increase, compared to the other approximate adders.

V. CONCLUSION
In this paper, two low error efficient approximate adders for FPGAs are proposed. The first approximate adder, LEADx, has lower MSE than the approximate adders in the literature. It also achieves better quality than the other approximate adders for video encoding application. The second approximate adder, APEx, has same area, lower MSE and less power consumption than the smallest and lowest power consuming approximate adder in the literature. It has smaller area and lower power consumption than the other approximate adders in the literature. Its MSE is second only to LEADx. Therefore, the proposed approximate adders can be used for FPGA implementations of error tolerant applications.