BDD-Based Error Metric Analysis, Computation and Optimization

Approximate Computing is a design paradigm that trades off computational accuracy for gains in non-functional aspects such as reduced area, increased computation speed, or power reduction. Computing the error of the approximated design is an essential step to determine its quality. The computation time for determining the error can become very large, effectively rendering the entire logic approximation procedure infeasible. In this work we extensively analyze various error metrics and approximation operations. We present methods to accelerate the computation of error metric computations by 1) exploiting structural information of the function obtained by applying the analyzed operations and 2) computing estimates of the metrics for multi-output Boolean functions represented as Binary Decision Diagrams (BDDs). We further present a novel greedy, bucket-based BDD minimization framework employing the newly proposed error metric computations to produce Pareto-optimal solutions with respect to BDD size and multiple error metrics. The applicability of the proposed minimization framework is demonstrated by an experimental evaluation. We can report considerable speedups while, at the same time, creating high-quality approximated BDDs. The presented framework is publicly available as open-source software on GitHub.


I. INTRODUCTION
Many applications in the domain of digital signal processing, such as image processing, do not require computations to be exact (see, e.g. [1]). This is due to the fact that the human perception itself is not perfect. In other situations, it might be difficult or even impossible to define what an exact result would be. Sometimes the customer is even willing to accept non-perfect results [2].
Approximate Computing [3], [4] is a design paradigm that exploits this fact and trades off computational accuracy for gains in non-functional aspects such as reduced area, increased computation speed, or power reduction. There are two main approaches to introduce approximations to the design (see, e.g. [5]) in order to achieve gains in, e.g., computation speed: (a) physical changes such as voltage overscaling or over-clocking (b) changes in the functionality of the design. In this paper, we pursue the second approach.
Let f be a Boolean function describing some functionality (e.g., realizing some image processing task) that is The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Hossein Moaiyeri.
implemented correctly, i.e. a function that does not produce erroneous outputs. Byf we then denote a function that approximates the original function f , i.e. f ≈f with some notion of similarity.
It turns out that determining the similarity between f andf is surprisingly difficult. This greatly hinders the automated approximate synthesis of functions. This is especially true for design methods, such as evolutionary algorithms, that rely on repeatedly evaluating the similarity. Therefore, there is an obvious need for methods that determine the similarity of a design f and its approximationf quickly or, at least, compute an estimate of the similarity in a reasonable amount of time.
In the domain of approximate computing, instead of computing the similarity between the function f and its approximationf , the error the functionf has with respect to f is computed instead. For this, many different error metrics have been proposed in the literature. Each of these error metrics capture different aspects off being erroneous, thereby capturing different notions of similarity.
An increasing amount of research activity (semi-)directly dealing with error metrics clearly shows that understanding VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the characteristics and computing the values of error metrics in the domain of approximate computing is of paramount importance.
As we understand error metrics as playing the central role in approximate computing, this work aims at making an extensive examination of various error metrics. This includes properties of the metrics themselves as well as how they can be computed efficiently and quickly in a synthesis setting representing Boolean functions as Binary Decision Diagrams (BDDs). We present a greedy bucketbased synthesis algorithm that exploits all findings presented in this publication. It turns out that this synthesis approach is very fast and scalable to large Boolean functions while, at the same time, very competitive with previous approaches or general genetic algorithms. In many cases, it is even capable of outperforming previous work.
The remainder of the work is organized as follows: First we start with a brief survey of the related work dealing with error metrics and approximate synthesis. Then, Section III introduces the necessary background and notations to be able to follow the paper. Section IV extends the previous findings of [6] by determining the complexity of the computation of further error metrics further showing the need for efficient means to determine error metrics. Section V briefly examines the unexpected behavior of the rounding operations defined in [7]. Section VI is the theoretical main part of this work. It contains all the findings that allow to speed up the process of error metric computation. In the following section, Section VII, all the previous results are used to develop a greedy bucket-based approximate synthesis algorithm. In Section VIII, experiments evaluating the proposed synthesis method are conducted before the work concludes with Section IX.

II. RELATED WORK A. ERROR METRIC COMPUTATION
The notion of similarity is indirectly captured by error metrics that range from simply counting how often a design yields incorrect results to computing the mean squared error (see [8] for a good overview).
Making use of BDDs for the computation of error metrics is a common approach. In [9] and [10], miter circuits are represented as BDDs in order to quickly determine the error rate between the golden implementation and the approximation. The authors of [7] introduce multiple BDDbased algorithms to compute even more complex error metrics. In [11], the authors make use of Algebraic Decision Diagrams and Gröbner Bases in a unified framework that allows the computation of all error metrics introduced in [8]. In [12], the authors of this document already presented initial results for quickly computing error metrics or determining lower and upper bounds for them by using BDDs and exploiting characteristics of the approximation operators.
However, many approaches have in common that sooner or later they reach computational limits. Studies on the complexity of the error metrics have shown that they reside in complexity classes #P and FP NP [6], explaining the scalability issues.
In the literature, researchers try to cope with this complexity by using Monte Carlo Simulations, see e.g. [13].
When a specific function, such as an adder or multiplier, is approximated in a constructive fashion, i.e. there is a simple construction rule for the approximated circuit, computing some error metrics becomes simple. The FAU adder [14], for example, is a parameterized approximate adder circuit for unsigned integers. Based on its parameters, the error rate, the average-case error and the worst-case error (see Section III-C for the definitions of these metrics) can be computed using a closed formula.
A more general approach to determine the error probability of adder and multiplier circuits is presented in [15]. Starting with an arbitrary input distribution (i.e. the probabilities of the inputs being 1 or 0), the authors construct a model of the underlying arithmetic circuit and then derive closed formulas for the error rate. However, this only works for circuits using construction schemes as proposed by the authors.
In [16], the authors deal with the problem of not just computing the probability of an error for a single Boolean function but for a compound of functions. They do so by propagating error probabilities through the compound function and explicitly ignore re-convergences (i.e. the result is not correct for such cases). This restriction makes the approach very scalable. As it turns out, the error introduced by ignoring re-convergences is very much negligible. As in [15], the authors allow the inputs to be arbitrarily distributed and are restricted to the error rate as an error metric.

B. APPROXIMATE LOGIC SYNTHESIS
A common approach for approximate logic synthesis is to perform a design space exploration using Cartesian Genetic Programming (CGP) in order to compute Pareto sets of approximated circuits. In these works, either only a simple error metric is used [10] or great efforts are made to speed up the error metric computation by using a formulation specifically tailored to work well with Boolean satisfiability (SAT) solvers [17]. The authors of [8] are capable of computing a high-quality Pareto front by restraining themselves to 8-bit adders and multipliers. Results for multipliers of up to twelve bits with guaranteed bounds on the error have succesfully been determined using CGP [18].
The authors of [7], while mainly focusing on how to actually compute different error metrics efficiently by employing BDDs, introduce and formalize new approximation operations. Unfortunately, the paper only experimentally shows the general applicability of these operations and does not present a full synthesis approach. In [19], an evolutionary algorithm working directly on the BDD representation of designs is proposed. It automatically adapts its preference for either the BDD size or the error metric value in order to ensure that no good solution is lost due to optimizing too much for a small error. In [12], a greedy bucket-based algorithm for approximating designs that employs rules for successively applying approximation operations in a greedy fashion is presented.
The authors of [20] use cubes to represent the designs being approximated. The cubes can be used both for generating approximation candidates as well as for directly computing the error rate. While this allows for a rather scalable synthesis (including multi-output functions), the approach can not (directly) be used when other error metrics than the error rate are of interest.

A. NOTATION AND CONVENTIONS
In this paper, all functions will be of type f : B n → B m . The m individual output functions are denoted as f i . The interpretation of f (x) as a natural number with the usual binary encoding is denoted by val(f (x)). While all the results presented in this paper extend to signed numbers, we omit the details for simplicity.
For a function f with m = 1, i.e. a Boolean function, its ON-set is denoted by ON(f ), i.e.
Analogously, the OFF-set OFF(f ) denotes the set of input values x for which f (x) = 0 holds. The positive/negative cofactor of f for a variable x, denoted by f x and f x , respectively, are the functions obtained by setting the input variable to 1 or 0, respectively, i.e., the negative co-factor of f with respect to x is given by As a shorthand, we define the absolute difference function of f and its approximationf as We make use of Binary Decision Diagrams (BDDs) [21] to represent Boolean functions. For simplicity, a collection of BDDs representing a multi-output function will also be referred to as ''a BDD''. Given a Boolean function f , we use #f to refer to the number of nodes the BDD representing f has.
Example 1: Fig. 1 depicts the BDD for the one bit full adder function summing up a, b and c in generating a sum bit s and a carry bit c out . The nodes are labeled with the variable used for the decomposition; dashed lines indicate the negative co-factor and solid lines indicate the positive co-factor. The number of nodes, i.e. #f , is 8. Note that the terminal nodes are not counted.

B. APPROXIMATION OPERATIONS ON BDDs
There are multiple methods for approximating functions given as BDDs in order to reduce the BDD's size. In this paper, we restrict ourselves to the approximation operations presented in [7].
The two operations round-down and round-up, f x and f x take each node at level of the variable x and below and replace the child node with the smaller ON-set with the 0-or 1-terminal, respectively.
The round operation [f ] x replaces all nodes labeled x with either 1 or 0 depending on the relative sizes of the ON-and OFF-set of the function represented by the node. If the size of the ON-set is larger than the size of the OFF-set, the node is replaced by 1, otherwise 0 is chosen as the replacement.
The positive/negative co-factor approximations replace nodes by their positive or negative co-factors. The round operation and the co-factor approximations ensure that the resulting BDD does not contain the variable on which level the operations was performed on any more, i.e. there is a guaranteed reduction in the BDD size.
Example 2: Consider the function f represented by a BDD, shown in Fig. 2(a). Applying the round-down operation at level x 2 , the BDD is reduced by two nodes. The affected edges are indicated by the × symbol. The resulting BDD is shown in Fig. 2 Note that the nodes for the variable x 2 are not completely removed by the round-down operation. This illustrates that only the general round operation and the co-factor operations guarantee to completely remove nodes at a given level.

C. ERROR METRICS
There are many different error metrics for assessing the quality of an approximated functionf , each measuring a different aspect of the introduced error. Mrazek et al. give a good overview over commonly used metrics in [8]. Please note that we use a slightly different naming scheme for the error metrics. While our framework is capable of computing all error metrics from [8], we will only discuss a selection of them in detail in this paper. The interested reader is referred to our source code on GitHub [22].
The error rate er counts how often the approximation and the original function differ without taking into account any semantic meaning of the output. The average-case error (ace, sometimes also called mean error distance, MED) and worst-case error (wce) compute the average and maximum absolute difference in the output of f andf , respectively. The average-case relative error (acre) and the worst-case relative error (wcr) relate the difference to the correct value of the unapproximated function f . The formal definitions of these metrics are as follows: Example 3: Consider again the functions f andf from Example 2. In Fig. 2(c), the corresponding truth table and the numerical values are shown. Using these values, this allows to determine the following error metric values:

IV. ERROR METRIC COMPLEXITY
It was previously shown that the complexity class of computing the error rate, average-case error, and average bitflip error are #P, while the worst-case error and worst-case bit-flip error are in the FP NP class [6].
In this paper, the additional error metrics average-case relative error and worst-case relative error are investigated. Since they are not explicitly covered in [6], we include a short proof of the complexity for both metrics as well. This can be easily done, as the cited paper provides a framework for proving the complexity of error metrics that can be constructed in a certain way (see below).
The authors of [6] further generalize the common averageand worst-case error metrics to the generalized averageand worst-case error metrics. These generalized versions of the metrics cover the essence of their respective concrete metric. The idea is then to construct these generalized metrics by using so called rating functions φ covering the specific aspect of the metric. The generalized average error (gae) and generalized worst-case error metrics (gwe) are defined as with a scaling factor λ, respectively.
If the function φ is in the class NC 1 (roughly speaking: a circuit realizing the function can be created polynomially; see [23] for details), computing the generalized metrics has the complexity of #P or FP NP , respectively.
In the original rating function definition, the property φ(a, b) = φ(b, a) was required to hold for all a, b ∈ B m . This property is not necessary for the proof. It was introduced to ensure that error metrics, constructed using the rating function, are commutative.
As the wcr and acre error metrics are not commutative to begin with, we drop this requirement for the corresponding rating functions. Both relative error metrics can then be described by a single rating function: i.e. we can write acre and wcr as As φ is in NC 1 , this proves that evaluating the average-case relative error has the complexity of #P, while evaluating the worst-case relative error has the complexity FP NP .
For the sake of completeness, we also determine the complexity of evaluating the mean-squared error (mse), defined as as well.The rating function φ(a, b) = |val(a) − val(b)| 2 can be used to construct mse and, again, lies within the NC 1 class. This proves that evaluating mse is in the #P complexity class. As this error metric is mainly used in domains such as optimization, it is not investigated further in the rest of this paper.

V. APPROXIMATION OPERATOR ANALYSIS
The operators originally introduced by [7] in Section III-B are suitable choices in practice, but careful inspection yields an interesting property that has to be kept in mind when using them. The round-up operation has a behavior different to initial expectations, setting all references to nodes at or below the approximation threshold x i to 1.
This can be easily seen by analyzing what the roundup operation does to a node that has the terminal 0 node as one of its children. This child has the smallest possible ON-set as the other child must have at least one one-path as, otherwise, it would have been pruned away. Therefore, the terminal 0 child is replaced by the terminal 1 node (see Figure 3(a); triangles represent a whole sub-BDD). This effectively removes all links to the terminal 0 node within the range that is being rounded. As only one-paths remain within the range, the rounded part of the BDD collapses to the terminal 1 node. This behavior often leads to a drastic, and possibly unintended, reduction in the BDD size, resulting in large approximation errors. This somewhat limits the applicability of the round-up operator.
This property does not necessarily hold when only applying the operator to a range of variables from x i to x j while leaving all nodes with variables x r with r < i unapproximated. The variable x j must not be the last variable in the BDD. Therefore, we typically use the round-down and roundup operators not for all variables at or below a threshold but within the bounds of two variables. For this, we define f x i ,x j and f x i ,x j with i ≤ j to be the round-down and round-up operators working on levels i through j, respectively. These generalizations include the originally defined operators as for a BDD with a total of k levels, the identity f Example 4: Consider again the BDD from Example 2. The round-down operation as defined in [7] operates on the levels of x 2 and x 3 . Hence the round-up operation can be defined in terms of the generalized round-up operation by explicitly stating these levels, i.e., f This allows to generalize the above statement: The generalized round-up operation f x i ,x j removes all zeropaths within the rounded range, i.e. all paths from a node for variable x r to the terminal 0 node for i ≤ r ≤ j.  Restricting the round-up operation to the x 2 level only, i.e. f x 2 ,x 2 does yield an approximated BDD that is not entirely pruned as shown in Figure 3(b). The left-most x 2 node has been removed by the operation. The left-most x 3 was removed as it no longer was the child of a node and the remaining x 2 nodes are merged as they become isomorphic.

VI. ACCELERATING ERROR METRIC COMPUTATIONS A. AVERAGE-CASE ERROR COMPUTATION
The average-case error is, next to the error rate, one of the most commonly used error metrics. In the following, we will show how to speed-up its computation by exploiting the general structure of the metric as well as the structure of the approximated function generated by certain approximation operations.

1) AVOIDING CONSTRUCTION OF THE CHARACTERISTIC FUNCTION
In [7], the authors construct the BDD for the characteristic function of d = |f −f | to compute the average-case error. It can easily be seen in [7, Table 3 ] that switching to an algorithm constructing the characteristic function greatly increases the computation time. It turns out that this construction can be avoided entirely and ace can simply be computed by determining the size of m ON-sets instead as shown by the simple computation Determining the ON-set can be done in linear time with respect to the nodes of the BDD representing d.

2) EXPLOITING APPROXIMATION OPERATION PROPERTIES
The time needed to construct d itself already makes up a considerable percentage of the total computation time. Table 1 contains a representative analysis of partial computation times for the ace error metric (see Section VIII for details on the experimental setup). As can be seen, the construction of d makes up for at least 90% of the overall time. It is therefore desirable not to construct d at all. It turns out that this is possible if the inequality val(f (x)) ≤ val(f (x)) holds.
In this case, one does not need to construct any new function but can directly compute the average-case error by re-using the idea from the previous section as It turns out that the round-down operation from [7] can guarantee the even stronger subset propertyf i (x) ≤ f i (x) for 1 ≤ i ≤ m (see also Fig. 2(c)). This property will allows us to actually construct the function d in an efficient manner.
Using a full subtractor circuit for the computation of d, the individual output bits are given by d i = f i ⊕f i ⊕ c i with the carry bits being computed as By induction, it can easily be shown that all c i are 0: Assume that c i = 0 (which holds for the initial carry bit c 0 ), then This allows a quick computation of error metrics such as ace, as the construction of d becomes trivial, i.e. d i = f i ⊕f i . As the average-case error is symmetric, i.e. ace(f , g) = ace(g, f ), the cases val(f (x)) ≤ val(f (x)) and f i (x) ≤f i (x) can be handled trivially by swapping the arguments to ace. This makes the simplification applicable to the round-up operation.
As long as the round-up and round-down operations are not mixed in consecutive approximation steps, the proposed construction of d can be used. This optimization is automatically used whenever applicable during the absolute difference computation. It can therefore be exploited by the greedy, bucket-based BDD minimization algorithm presented in Section VII.

B. ERROR RATE COMPUTATION
It turns out that, unfortunately, for the error rate, as opposed to the average-case error, no specialized computation using the structure of BBDs generated by the proposed approximation operations of [7] exists. Note that this does not mean that no algorithm for computing the error rate faster can exist, only that the speedup will not be based on the approximation operation used.
The proof for this statement involves reducing the computation of the error rate in the general case without any knowledge of the structure of the BDDs to the case after a specific approximation operation has been applied. The operations presented in [7] have in common that they can eliminate entire branches of a BDD and commonly do so. This is one of the primary mechanism by which the BDD size is reduced after the approximation. Note that this is not a special property of these operations. Removing branches will occur in all reasonable approximation operations on BDDs.
The reduction from the general case to the special case based on the approximation operation is performed as follows. Let f be an arbitrary function andf be the approximation of f obtained by applying the approximation operation op, i.e.,f = op(f ). For these two functions, er(f ,f ) is to be determined. We now construct a new function g, containing f ⊕f at a specially chosen position within the BDD. Computing er(f ,f ) can then be carried out by computing er(g,ĝ).
The function g is constructed in such a way that applying the approximation operation op to g, i.e. constructinĝ g = op(g), removes all parts of f ⊕f by replacing them with a terminal node. The construction of g does depend on the chosen approximation operation op.

1) ROUND-UP, ROUND, CO-FACTOR APPROXIMATION
First, we introduce a new BDD variable v as the topmost variable. This means that v is in the root node of the corresponding BDDs. Construct g = (g 1 , . . . , g m ) by defining the component functions as The function f i ⊕f i now fully resides in the low-child branch of g i , i.e. the branch for which v = 0 holds. The constructed BDD is shown in Figure 4. Applying the round-up, round or positive co-factor operation at the variable v will always remove this branch and set it to 1 since it is guaranteed to have a smaller (or equal) ON-set than the high-child branch (i.e. v = 1). This completely reducesĝ to 1 (see also Section V). To achieve the same for the negative co-factor operation, one simply uses v in (7) instead of v.
Computing the error rate for g andĝ then is given by allowing to compute the error rate of f andf as er(f ,f ) = 1 − 2 · er(g,ĝ).
As the domain of g is twice as large as the one of f (the newly introduced variable v doubles the number of possible inputs) the values need to be scaled accordingly.

2) ROUND-DOWN
The construction of g for the round-down operation is again as in Eqn (7). In this case, the rounded function does not reduce to 1 but becomes the identity in v instead, i.e.ĝ = g v = v. Computing er(g,ĝ) as above yields er(g,ĝ) = 1 2 er(f ,f ), allowing to compute the error rate of f andf as er(f ,f ) = 2 · er(g,ĝ). The factor 2, again, comes from the introduction of the additional variable v. Now assume that an error rate computation er special that is highly specialized for the applied approximation operation (e.g., round-up) exists. This computation involving knowledge about the approximation operation is also capable of computing the general error rate between two arbitrary functions via er(f ,f ) = 1−er special (g,ĝ) or the direct identity er(f ,f ) = er(g,ĝ), depending on the applied approximation. Therefore, it can not be significantly faster than the general version. The only possible speedup may result from not explicitly needing to compute the exclusive or. But this already is a very fast operation.
For the sake of completeness, we mention that the above observation also holds for the computation of the worst-case bit-flip error (see [6]). As this metric is rarely used and the construction and proofs are very similar, we omit the details here.

C. IMPROVED SAMPLING OF THE ERROR RATE
An easy way to obtain an approximation to the error rate is sampling. The functions f andf are evaluated for p uniform random inputs. The number w of inputs for which f (x) =f (x) is then counted. The approximated error rate iŝ er(f ,f ) = w/p. This value, however, is only an estimate that will converge to the error rate with an increasing number of samples p (with the number of necessary samples being large; see, for example, [24]). In this section we present an improved sampling method that works especially well when the actual error rate is low.

1) PROPOSED SAMPLING METHOD
In regular sampling, all samples are drawn from a uniform distribution from all 2 n possible inputs, we propose a sampling method that only samples on inputs x for which f (x) =f (x) holds. This approach is based on the observation that the error rate can be computed as Algorithm 1 depicts the proposed sampling method. For every output bit, p/m samples with f i =f i are drawn (lines 3-7). If f andf differ in the current output bit (ignoring bits not already looked at), the error is recorded (lines 8-9). Then, the approximated error rate is increased by the size of the ON-set of the current difference function h = f i ⊕f i times the number of hits (line 10) and then is finally scaled to reflect the whole input space B n as well as the partitioning into separate output bits (line 11). VOLUME  In order to ensure that the samples in line 6 of the algorithm are uniformly chosen, the extension of Algorithm C in [25, p.76-77] is used.

2) PERFORMANCE OF THE PROPOSED SAMPLING
We first analyze the variance of the naïve sampling and for the proposed improved sampling method. LetŜ n andŜ i be the result of the naïve and improved sampling, respectively, scaled by the number of drawn inputs p, i.e.Ŝ = S/p.
The output of the naïve sampling follows a Bernoulli distribution as all samples are stochastically independent and only have two outcomes. Recall that the variance of evaluating a random variable with Bernoulli distribution with probability r for k times is k ·r ·(1−r) [26, p. 228]. Using the abbreviation e := er(f ,f ), the variance of the naïve sampling is then given by Computing the variance for the improved method is more cumbersome and, hence, only the following upper bound is presented. With this knowledge of the variance of both methods, the number of samples necessary for some desired maximum deviation from the error rate can be calculated. The probability of the sampled error rate being outside an ε-neighborhood of the true error rate e can be estimated using Chebyshev's inequality (see, e.g., [26, p. 233

]) as
Bounding this probability by r for a fixed ε allows to solve for the necessary number of samples p to ensure to lie within the ε-neighborhood as We are not the first to investigate the necessary number of samples to bound the estimate of the error rate. In [27], an lower bound on the number of samples N f , similar to p n , is derived. Both numbers, N f and p n , depend on the actual value of er(f ,f ) which is not known in advance. While in [27] this issue is addressed by iteratively approximating the error rate and the number of necessary samples alternatingly, the solution presented here is to use the improved sampling yielding a lower bound on the samples p i that is independent of the actual error rate.
It should be noted that both, the variance in Equation (8) and the probability in Equation (10), are bounded by very pessimistic bounds. As shown in Figure 5, in practice very few samples, in fact, way less than estimated by the Inequality (11), are needed to reach a better estimation than with the naïve sampling approach.
Still, some general statements can be made by assuming that the general trend of the bounds is representative for the real behavior. Both numbers, p n as well as N f , increase with a decreasing error rate. Thus, the improved sampling method is better suited for such situations. Due to various other factors (e.g., number of output functions m, distribution of minterms between functions), no general decision when to use which sampling method can be derived.
The results for the improved sampling methods are independent of the function representation used, i.e. do not depend on using BDDs. The operations to be performed by Algorithm 1 are well-supported and easy to compute by means of BDDs, making them a natural choice for the implementation.

D. APPROXIMATING THE AVERAGE-CASE RELATIVE ERROR
Computing the average-case relative error is very expensive; in [11], the authors had to report timeouts for some of their benchmarks. Consequently, we propose an approximation method for the error metric, computing an upper and lower bound on acre. We further define the arithmetic mean of these bounds to be an approximation of acre, i.e. acre(f ,f ) := low + up/2.
Our approach is shown in Algorithm 2. The idea is to iteratively assume that the i-th bit of f evaluates to 1 and the more significant bits evaluate to 0. The variable mask represents this condition (see its creation as f i ∧ m j=i+1 f j ).
This mask is applied to the computed absolute difference to create an approximated differenced and its average value is computed. We then compute the upper bound for the error of bit i by further assuming all less significant bits are 0 and dividing by 2 i−1 . The lower bound is computed analogously. Note that for the computation of |f −f | the simplifications discussed in Section VI-A2 are applied if possible and that for the computation of the average of the approximated differenced the idea from Section VI-A1 is re-used.
The difference between the correct value for acre and its approximationâcre has an upper bound of The relative width of the interval produced by Algorithm 2 is at most 2, i.e. up/low ≤ 2.

E. EXACT BDD BASED WORST-CASE RELATIVE ERROR COMPUTATION
While the BDD based algorithm for computing the worstcase relative error presented in [11] computes the correct error value, the long computation time prohibits its application to most functions. We developed an alternative algorithm for computing this error metric efficiently and exactly.
An overview of this method is depicted in Algorithm 3. As a pre-processing step, the maximum of f and 1 is computed to avoid divisions by zero by setting f * ← max (1, f ). This can be done by setting f 1 ← f 1 ∨ n i=1 f i while leaving all other sub-functions untouched.
The largest wcr value found so far is stored as a fraction with separate variables for the numerator and denominator. These variables are then updated iteratively. In each iteration, the set of all inputs resulting in a higher wcr value than the current one is computed. The function g is computed by building the lexicographic order of h and l symbolically. From these inputs, a single input is chosen uniformly at random and used to update the current wcr value by passing it to the d and f * functions. This process is iterated until no increase in the wcr value is possible anymore. Even though the choice of the satisfying assignment is random, the algorithm is exact as this process is iterated until no further improvement is possible.
Example 7: We illustrate the algorithm by computing the wcr of the functions f andf from Fig. 2. In this case, we have f * = f (the function d is shown in the Table in Fig. 2(c)). In the first iteration, we have h = d and l = 0. This means that g represents the set of all inputs for which d is nonzero (i.e. {010, 100, 110}). We chose i = 100 as the satisfying input and update numerator and denominator to become 2 and 3, respectively. In the next iteration, the function g represents the inputs for which 3 · d > 2 · f * , which is {010, 110}. Choosing i = 110 yields 1 for both, denominator and numerator. In the next iteration step, g becomes 0 and the algorithm terminates with the correct value (see Section III-C) of wcr(f ,f ) = 1/1 = 1.
On average, it is expected that half of all satisfying inputs in g have a lower or equal worst-case relative error than the selected one, since we choose it uniformly at random. We therefore expect to prune half of all input candidates in each iteration and the average number of iterations is bounded by n for all inputs. The computationally expensive parts of this algorithm are the two multiplications and the subsequent comparison. To reduce the necessary computations, three different optimizations can be applied. Multiplication with a power of two is equivalent with a bit-shift and does not require any computation. Therefore, the algorithm is prepended by computing the largest power of two for which there exists a greater wcr value. This value is then used to initialize the fraction used during the iteration.
The second optimization aims to reduce the computation time for the multiplication operations. Without modification, the operation multiplies all values generated by the functions with the given constant. This is not necessary, since only the function values for the inputs that can result in a still higher wcr are needed. To achieve this,d = d ∧ g andf * = f * ∧ g are computed with g from the previous iteration.d andf * are then used instead of d and f * for the multiplications.
The last optimization is to reduce the expected number of iterations of the algorithm. Choosing a random satisfying input to a BDD is a very cheap operation. It is therefore

Algorithm 3 Randomized wcr Computation
input : Functions f ,f output : wcr(f ,f ) as a fraction Pre-processing reasonable to not use just one random sample, but many. With 1000 samples for example, the expected number of iterations of the outer loop drops to log 1000 (2 n ) = n · log (2) log(1000) ≈ n/10. This can drastically improve the performance if a reasonable number of samples is chosen.

VII. GREEDY BUCKET-BASED BDD MINIMIZATION
For BDD minimization, in general as well as in approximate settings, using genetic algorithms is a common approach, see, e.g., [10], [19], [29]. The optimization problem inherently has a multi-dimensional target space, minimizing both the BDD size and the error introduced through the approximation. While this makes genetic algorithms an appropriate choice, properly designed greedy algorithms can compete as well. By quantizing the space of possible error metric values, a single smallest function approximation can be stored for each possible error metric value tuple. For simplicity, the following description will only apply to exactly one error metric. The method can easily be extended to multiple error metrics.
Algorithm 4 depicts the proposed method. Given a maximum acceptable error metric value m e , the algorithm splits the range of error values into n buckets of equal size holding one approximated function with an error in that range. Initially, the first bucket is initialized with the original function f , all other buckets remain empty.
The algorithm works by iteratively applying approximation operations from a set of allowed approximation operations A, e.g. rounding-down on all levels, to the function f . We support all approximation operations defined in [7].
For each of these operations, the error metric is calculated and the matching bucket with index i for the error rate is determined. This computation is an implicit check whether the threshold m e is reached: if i is larger than the number of allocated buckets, the error is too large.
If the error is in the accepted range and the newly created function is smaller than the one currently stored in the bucket for this error range, the function in the bucket is replaced. If the bucket was empty, it will be unconditionally filled with d . Then, the bucket storing the currently used function is inserted into a queue to re-visit it later and apply further approximations.
This process is repeated until all operations have been applied. The algorithm terminates as soon as the queue of buckets to visit is empty.
The proposed method implicitly applies all ideas presented in Section VI to reduce the run time. We omit the details for didactic reasons.
After the algorithm has finished, empty buckets are discarded. The remaining buckets then form the computed Pareto front.
Example 8: We illustrate the method using the data in Table 3 with the error rate as the used metric. The upper bound for the error rate is 5%. The range from 0% to 5% error rate is divided into the five buckets [k%, k + 1%) for k = 0, . . . , 4. The upper bound of these ranges is exclusive, i.e. an approximated functionf that has an error rate of 2% will by stored in bucket number 3, i.e., [2%, 3%). Next to the function, the corresponding BDD size is shown. The first selection step (1 (s)) selects the function stored in bucket 1 (highlighted in bold). To this function, all operations from the set A are applied (lines 3-13 in Algorithm 4), yielding two new functionsf 1 andf 2 that are within the error threshold and smaller in size than f (see step 1(a)). They are placed in bucket 1 and 2 respectively.
The decisions in lines 11 and 13 of Algorithm 2 pick bucket 1 for the next step (step 2(s)). After applying all operations from A, the new functionsf 3 andf 4 are added to the corresponding buckets (step 2(a)). The algorithm then continues with bucket 2 (step 3(s)), even though the function stored in it was not added in the previous step. This process continues until there are no buckets left to pick unapproximated functions from.
In this example, this situation is reached after ten iterations. The Pareto front for the objectives BDD size and error rate are exactly the functions in the buckets in the last line of Table 3.
One advantage of this optimization procedure is that an upper bound on the necessary computation time can be formulated. During the algorithm execution, the function size in each bucket monotonically decreases. As the maximum function size in each bucket is the initial function size, each bucket can only be updated at most as many times as there are BDD nodes in the initial function. This leads to a maximum number of bucket updates of n · #f . In each update, all approximation operators are applied once and the error metrics are computed. With the complexity of the error metric computation c e , the resulting upper on the optimization complexity is O(n · #f · |A| · c e ). Note that this bound will not be reached in most cases. The complexity can be further improved by only updating a bucket if the function size decreases by a minimum amount s.
The presented approach can trivially be extended to compute a Pareto front containing multiple error metrics. Instead of a list of buckets, a multi-dimensional grid of buckets can be used. The computation of the next index i is then performed individually for each error metric.

VIII. EXPERIMENTAL RESULTS
We implemented all methods and algorithms presented in this paper in C++ using the CUDD framework [30] for manipulating BDDs. To facilitate reproduction and further works, we publish our framework on GitHub [22]. The experiments were performed on an Intel R Core TM i5-4200U CPU @ 1.60 GHz running Linux (Ubuntu 18.04). Proper measures were taken to avoid contaminating timing measurements with the caches employed by CUDD. To provide accurate millisecond timings, the google benchmark library 1 was used.
As examples for real world functions, the ISCAS'85 dataset [28] and approximate adders from the KIT opensource Approximate Adder Library [31] were used.

A. AVERAGE-CASE ERROR COMPUTATION
We evaluated the computation of the average-case error when using the new improved ON-set based algorithm. The characteristic function based computation from [7] is used as a baseline. Table 4 shows the computation time of these two algorithms when applied to approximated circuits from the ISCAS'85 dataset. It can be seen that the proposed method is faster than the previous one based on characteristic functions.
While for smaller functions the speed improvement is around a factor of ten, it increases with increasing number of nodes in a function. This result is expected as the proposed average value computation examines every node in the BDD exactly once and thus runs in linear time. In comparison, the characteristic function based implementation may construct an exponential number of new nodes.

B. APPROXIMATING THE AVERAGE-CASE RELATIVE ERROR
To justify approximating the average-case relative error, we compare the computation time and accuracy of the approximate versions with exact computation methods. For computing the average-case relative error exactly, so far only the ADD based computation introduced in [11] is known. Table 5 shows the results of our approximate computation when compared to the reference implementation [11]. Both approaches were run on a selection of approximate adders taken from the open-source KIT library [31]. The table presents the circuit name, the exact error and our approximated error, the relative error, the runtimes and speedup achieved by our approximation method.
The speedup ranges from around five to ten for small 8-bit adders to roughly 23, 000 for 16-bit adders. At the same time, the relative error never exceeds 0.05 which is within an acceptable range. For larger circuits, the relative error cannot be calculated as the reference implementation times out after 60 minutes and, therefore, no values to compare against could be obtained. Comparing the values for the average-case relative error and the runtime of the approach from [11] to our proposed average-case relative error approximation. The values for acre,âcre, and the runtimes are rounded for readability.

C. SPEEDING UP THE WORST-CASE RELATIVE ERROR
To show the effectiveness of Algorithms 3, we ran it on a set of approximations of ISCAS'85 functions [28]. As a baseline comparison, we used the ADD based computation proposed in [11] and a BDD based symbolic division. Algorithm 3 was used with all of the proposed optimizations, with 10 random samples in each iteration.
The results can be seen in Table 6. The ADD based method and symbolic division are clearly inferior since they can only evaluate the metric for the small function c432. For the larger functions, constructing the ADD or computing the symbolic division had to be stopped due to memory limitations on the testing system. The computation time of our propposed algorithm is largely dominated by the absolute difference computation (column |f −f | for reference). This step is a basic necessity and cannot be elided.
The results clearly show that the proposed wcr algorithm is fast enough to be applied during BDD based function minimization. Its computation duration is similar to other, easier to compute error metrics like the average-case error.

D. GREEDY BUCKET-BASED BDD MINIMIZATION 1) SINGLE-ERROR-METRIC OPTIMIZATION
To demonstrate the effectiveness of the proposed greedy BDD minimization procedure, we ran it on different functions from the ISCAS'85 dataset. As a baseline comparison, the genetic optimization framework pagmo2 [32] was used, running NSGA-II [33] to jointly minimize BDD size and error metrics. Both algorithms were given the same set of approximation operations to select from and were allowed the same amount of computation time. Table 7 shows the results of running the NSGA-II-based algorithm and our proposed approach. As the benchmarks are not arithmetic functions, the error rate was chosen as the error metric and the maximum accepted value for the error rate was set to 5%. A total of 15 buckets was used. We report the smallest function found below the threshold of 5% error rate. All operations introduced in this paper were used to create as the set of possible approximation operations. They were applied to every variable independently for every sub-function f i . While only using the round-down or roundup operations exclusively would speed up the error metric computation as discussed in Section VI-A2, the minimization results would be of very poor quality. This would be a very bad runtime-quality trade-off. The automatic application of the optimization still saves computation time in the initial stages of the algorithm when only few operations are applied that all round to the same terminal.
It can be seen that the proposed method performs well, with respect to runtime as well as to the objectives error rate and BDD size. Except for the c3540 function, the proposed method is faster than the NSGA-II approach. 2 The speedup is within the range of 1 to 10 with comparable error rates. As can be seen, the BDD sizes produced by the proposed method never exceed the NSGA-II results while improvements up to using only 70% of the NSGA-II size are achieved.

2) MULTI-ERROR-METRIC OPTIMIZATION
To further demonstrate the applicability of the proposed BDD minimization framework, we present results for optimizing the size as well as multiple error metrics in two different settings.
In the first setting, the size of the approximated function as well as the two error metrics error rate and average-case error are optimized at once. In this experiment, we aim to compare ourselves against the adders from [31]. For each TABLE 6. Comparing the proposed exact worst-case relative error metric computation algorithm with previous work. The selection of the functions and operators is chosen as in [7]. The functions are from the ISCAS'85 benchmark set. Columns with n/a indicate that the computation could not finish due to memory limitations.

TABLE 7.
Comparing the proposed greedy, bucket-based BDD minimization method to a default implementation of an evolutionary algorithm using NSGA-II [32] optimizing for error rate and BDD size. error metric, the number of buckets is 10 and the threshold for the error rate is 5% in all experiments. The threshold on the average-case error is set to 7, 35, and 9000 for 8, 16, and 32 bit, respectively. Table 8 depicts the results. In comparison to the NSGA-II-based approach, the proposed method still is ≈ 5.5 times faster on average.   Figure 6 shows the 100 buckets used during the optimization for the 8 bit adder. Note that not all of the buckets were filled with approximated adders. This means that the approximation operations used were not capable of producing smaller adders that stayed within the defined error bounds. The figures shows the expected behavior: larger values in one or both error metrics allow for a greater reduction in function size.
The resulting approximated adders are comparable to the results from [31]. For the case of 8-bit adders, in fact, the approximations produced by our approach are exactly the same approximations as produced by [31] with the configuration R = 1 and P = 4.
As our proposed method is capable of reproducing adders previously presented in the literature, we explicitly compare the adders of Table 8 against related work, namely GeAr [31], EvoApproxLib [8], and the FAU Adder [14].
In [34], an approach was proposed to make fair comparisons between arithmetic circuits with different bit widths. The idea is to scale the average case error by the number of output bits m, i.e., We adopt this in the comparisons presented in Table 9.
For the 8 and 16-bit case, we compare against three other designs from the literature. For the 32-bit case, we restrict  Table 8) is highlighted by a white star. Note that the displayed data is the Pareto front of the adder for the criteria BDD size, error rate and average case error.
our comparison to a single design from [31] and [14] as EvoApproxLib does not contain any 32-bit circuits. The er and ace values for GeAr have been determined by implementing these adders in the proposed framework and then computing the error metrics. For the FAU Adder, the formulas 1/2 · (2 −p − 2 −m ) and 1/3 · (2 m−2p−1 − 2 −m−1 ) have been used, respectively.
The proposed method is capable of finding approximated adders of similar quality as those of [31] and [14]. This does, of course, depend on the choice of the parameters R, P, m and p. The proposed method does not require any prior knowledge about how an approximation is to be carried out, i.e. there are no parameters to tune. The designer only has to specify the acceptable upper bound on the error metric(s) and the granularity of the buckets.
The circuits taken from EvaApproxLib have a rather large error rate. This is, most likely, due to the fact that the library has been created by performing a design space exploration with respect to area and power, taking into account four different error metrics at the same time. For example, the circuit add8u_5SY, only requires ≈ 40% power and area compared to an exact adder. The designs from [31] and [14] were manually developed with concrete hardware in mind. The methodology proposed in this manuscript is not tailored to any specific technology as it operates on BDDs. Consequently, Table 9 does not present data for energy consumption or area. One can see the proposed algorithm is capable of producing approximations similar to hand-crafted designs for the error rate and the average case error.
In the second setting, we compare our greedy bucket-based approach against the genetic algorithm presented in [19]. The optimization objectives are size of the function, the error rate, and the worst case error. For each error metric, 5 buckets are used. The error thresholds are 3% for the error rate and 5% for the worst-case error. While [19] used 10% for both, their approach did not always reach this limit. The average error metrics over all circuits are roughly comparable. Before starting the optimization procedure, the function's BDDs size is reduced by applying sifting on the input function f .
The results are shown in Table 10. The authors of [19] used parameters for sifting that we were not able to reproduce even though we also used CUDD [30]. Therefore, we list not only the resulting size off but also the initial size of the sifted f to make for a fair(er) comparison. Each row of the table shows the size of the resulting approximation as well as the error rate and the worst case error. The best values for #sift(f ) and #f i in each row are highlighted in bold. None of the optimization runs took more than one second of computation time. In fact, most computations were performed in less than 100ms. Unfortunately, the related work does not provide any runtimes, prohibiting a comparison for that measure. For future reference, allowing others to compare themselves to our work, our computation times are included in column Time.
It can be seen that in most cases, our approach performs better. Unfortunately though, this result is inconclusive: in all but three cases, the proposed approach only yields better results than the previous work when the sifted function, i.e. the input, is already smaller. Interestingly, in many cases, the worst-case error is also smaller when the proposed approach outperforms the previous approach. Having runtimes to compare against would allow for a more detailed assessment of the proposed approach.

IX. CONCLUSION
In this paper we extensively dealt with error metrics, the most crucial part of approximate computing, and their concrete computation. We extended the work of [6] by explicitly determining the complexity class of three more error metrics used in the domain of approximate computing.
We analyzed the approximation operators defined in [7] and found interesting properties for the rounding operations. The main result here is that consecutive applications of these operations can greatly reduce the time necessary to compute the average-case error.
In general, multiple means for the acceleration of error metric computations have been proposed. These accelerations, on the one hand, work by computing only the absolutely necessary parts of the metrics and, on the other hand, exploit the structure of the approximation operations themselves. We could also prove that for the error rate and the worst case bit flip error, no specialized computations taking less steps exist.
Furthermore, a greedy, bucket-based BDD minimization approach exploiting these accelerations has been presented. The experiments clearly show that significant speedups have been achieved, while at the same time, the Pareto fronts for multiple error metrics are close to the results of state-of-theart approximate logic synthesis approaches.
To allow for further research by interested scientists, we publish our framework as open-source software on GitHub [22].
The next goals we are pursuing is to analyze how well our results carry over to other graph representations of Boolean functions. We will investigate whether the work [11] can benefit from our results (or vice versa) as well as look at other decision diagrams such as word-level descriptions using K*BMDs [35]. Another direction we would like extend our work in is to also efficiently compute various error metrics for networks of Boolean functions similar to [16].

ACKNOWLEDGMENT
The author would like to thank the authors of [11] for providing a binary copy of their tool to run the comparisons.