Optimal Metastability-Containing Sorting via Parallel Prefix Computation

Friedrichs et al. (TC 2018) showed that metastability can be contained when sorting inputs arising from time-to-digital converters, i.e., measurement values can be correctly sorted without resolving metastability using synchronizers first. However, this work left open whether this can be done by small circuits. We show that this is indeed possible, by providing a circuit that sorts Gray code inputs (possibly containing a metastable bit) and has asymptotically optimal depth and size. Our solution utilizes the parallel prefix computation (PPC) framework (JACM 1980). We improve this construction by bounding its fan-out by an arbitrary $f \geq 3$, without affecting depth and increasing circuit size by a small constant factor only. Thus, we obtain the first PPC circuits with asymptotically optimal size, constant fan-out, and optimal depth. To show that applying the PPC framework to the sorting task is feasible, we prove that the latter can, despite potential metastability, be decomposed such that the core operation is associative. We obtain asymptotically optimal metastability-containing sorting networks. We complement these results with simulations, independently verifying the correctness as well as small size and delay of our circuits.


INTRODUCTION
Metastability is a fundamental obstacle when crossing clock domains, potentially resulting in soft errors with critical consequences [14]. As it has been shown that metastability cannot be avoided deterministically [25], synchronizers [19] are employed to reduce the error probability to tolerable levels. This approach trades precious time for reliability: the more time is allocated for metastability resolution, the smaller the probability of metastability-induced faults.
Recently, a different approach has been proposed, coined metastability-containing (MC) circuits [10]. It accepts a limited amount of metastability in the input to a digital circuit and ensures limited metastability of its output, so that the result is still useful. In a series of works [24], [3], [4], we applied this approach to a fundamental primitive: sorting. The circuit given in [4] is asymptotically optimal in depth and size.
Our Contribution: In this article, we present the machinery used to obtain the circuit from [4] in detail. We prove that CMOS implementations of basic gates realize Kleene logic (cf. [20, §64]), justifying the computational model introduced in [10] and used in this article.
The task of sorting an arbitrary number of inputs can be reduced to sorting two inputs by using sorting networks [21]. The 0-1-principle (cf. Section 2) shows that plugging an MC 2-sort(B) circuit (for B-bit inputs) into a sorting network (for n values) readily yields an MC circuit that is capable of sorting n inputs. Hence, we need to design a 2-sort(B) circuit sorting two inputs in an MC way.
As the choice of the encoding matters a lot for MC circuits, we characterize the set of input strings we want to * This article generalizes and extends work presented at DATE 2018 [4].
• Johannes Bund and Christoph Lenzen  sort ("valid strings"). A valid string is either a (standard) Gray code string or a string obtained from a Gray code string by replacing the unique bit that would change on the up-count to the "next" codeword by M for metastability (the third logic value in Kleene logic). When using nonredundant codes, the use of Gray codes is mandatory: when converting an analog value to a digital one, continuously changing the input can force any circuit (that uses the value in a non-trivial way) into metastability [25]. Moreover, for combinational circuits in the abstraction of Kleene logic, all output bits that change when flipping a given input bit must become unstable when the input bit is unstable, cf. [10]. For instance, encoding a value unknown to be 11 or 12 in standard binary code would result in a string that, once metastability has been resolved, may represent any number in the interval from 8 to 15, cf. Section 3. Valid strings arise naturally when stopping a Gray code counter asynchronously [12] or, more generally, whenever performing analog-to-digital conversion; respective circuits may risk multiple metastable bits to achieve better averagecase precision, but for the best worst-case precision one can stick to guaranteeing valid strings as output. Exploiting the structure of Gray code and the restriction to valid strings, we show how to reliably sort all inputs despite the uncertainty about the represented value arising from metastability.
We formally specify the 2-sort(B) circuit and then prove that the task of comparing two valid strings can be decomposed into first performing a four-valued comparison on each prefix pair of the two valid input strings, and then inferring the corresponding output bits. This reduces the design of 2-sort(B) to a parallel prefix computation (PPC) problem, which for our purposes can be phrased as follows. Fast PPC circuits that are simultaneously (asymptotically) optimal in depth and size are known due to a celebrated result by Ladner and Fischer [23]. Going beyond [4], we present the full range of solutions that can be derived using their framework, which allows for a trade-off between depth and size of the 2-sort circuit. Most prominently, optimizing for depth reduces the depth of the circuit by a factor of 2 compared to [4] to optimal log B , at the expense of increasing the size by a factor of up to 2.
However, relying on the construction from [23] as-is results in a very large fan-out. We present a modification reducing fan-out to any number f ≥ 3 without affecting depth, increasing the size by a factor of only 1 + O(1/f ) (plus at most 3B/2 buffers). In particular, our results imply that the depth of an MC sorting circuit can match the delay of a non-containing circuit, while maintaining constant fanout and a constant-factor size overhead. Due to the fact that PPC circuits lie at the heart of fast adders [27], we consider this result of independent interest.
We complement our theoretical findings by simulations confirming the correctness and small size of the devised circuits. Post-layout area and delay of the designed circuits compare favorably with a baseline provided by a straightforward non-containing implementation.
Organization of this Article: We discuss related work in Section 2. Some preliminaries, the computational model and its justification, as well as the problem specification are given in Section 3. Next, in Section 4, we break the task of designing a 2-sort(B) circuit down into comparing prefixes and subsequently generating the output bits out of the computed comparison values and the respective pair of input bits. The comparison can be further decomposed into sequential application of an associative operator, which enables application of the PPC framework to compute all prefixes efficiently in parallel with (asymptotically) optimal depth. In order to keep this article self-contained, we compactly review the PPC framework in Section 5. The section then proceeds to showing how to modify the construction for bounded fan-out and bounding the size of the resulting circuits. In Section 6, we implement the base operators by subcircuits and plug the pieces together to obtain complete circuits. We then simulate them up to an input width of B = 16 to independently verify their correctness, and provide delay and area of the laid out circuits. We compare to a non-containing version as baseline, demonstrating the controlled increase in size of the circuit. We conclude the article in Section 7, where we also briefly discuss followup work that generalizes our results, demonstrating that higher-level concepts of this work like sorting networks and parallel prefix computation are applicable to further MC circuits.

RELATED WORK
Sorting Networks: Sorting networks (see, e.g., [21]) sort n inputs from a totally ordered universe by feeding them into n parallel wires that are connected by 2-sort elements, i.e., subcircuits sorting two inputs; these can act in parallel whenever they do not depend on each other's output. A correct sorting network sorts all possible inputs, i.e., the wires are labeled 1 to n such that the i th wire outputs the i th element of the sorted list of inputs. The size of a sorting network is its number of 2-sort elements and its depth is the maximum number of 2-sort elements an input may pass through until reaching the output.
The 0-1-principle [21] states that a sorting networkassuming the 2-sort circuits are correct -is correct if and only if it sorts 0-1 inputs correctly. Thus, we obtain sorting networks for inputs that may suffer from metastability by constructing 2-sort circuits (w.r.t. a suitable order on such inputs) and plugging them into existing sorting networks.
Sorting networks have been extensively studied. Tight lower bounds of depth Ω(log n) (trivial) and size Ω(n log n) (see, e.g., [8]) are known and can be simultaneously asymptotically matched [1]. More practically, for small values of n optimal depth and/or size networks are known [6], [7], [21]. Accordingly, our task boils down to finding optimal (or close to optimal) metastability-containing 2-sort circuits. For B-bit inputs, our 2-sort circuits have depth and size O(log B) and O(B), respectively, which is (trivially) optimal up to constants; as size and depth of our circuits are close to non-containing 2-sort circuits (cf . Table 12), we conclude that our approach yields MC sorting networks that are optimal up to small constant factors in both depth and size.
Prior Work on MC Circuits: Recent work [10] shows that for any Boolean function a combinational MC circuit implementing its metastable closure (see Definition 3.8) exists. The metastable closure can be seen as a best effort to contain metastability: when for an input with (some) metastable bits the stable input bits already determine a given output bit of the original Boolean function, the closure attains the respective value on this output bit; otherwise it is metastable.
Unfortunately, the proof from [10], which uses a construction dating back to Huffman [16], yields circuits of exponential size in the number of input bits B. The same is true for speculative computing [28]. Unconditional lower bounds on MC circuits [17] show that this cannot be avoided in general, even if the implemented function admits a small non-containing circuit. The same work provides, assuming that at most k input bits can be metastable, a construction with multiplicative B O(k) and additive O(k log B) overheads in size and depth, respectively. For the 2-sort element, k = 2 (each Gray code string may contain one metastable bit), but the resulting circuits are still far from optimal.
In [10], an alternative construction relying on noncombinational logic is given, achieving (up to minor-order terms) factor 2k + 1 increase in size and additive Θ(log k) increase in depth of the resulting circuit; for a 2-sort circuit, k = 2, so these overheads are constant. Rule-of-thumb calculations suggest that optimized versions of the circuits presented here and derived by this method would have comparable performance. A fair and detailed comparison would require fully-fledged designs of both approaches, which is beyond the scope of this article. Note, however, that our design has the advantage of being purely combinational.
Parallel Prefix Computation: Ladner and Fischer [23] studied the parallel application of an associative operator to all prefixes of an input string of length (over an arbitrary alphabet). They give parallel prefix computation (PPC) circuits of depth O(log ) and size O( ) (where the circuit implementing the operator is assumed to have size and depth 1). However, when requiring optimal depth of log , their corresponding solution suffers from fan-out larger than /2. An earlier construction by Kogge and Stone [22] simultaneously achieves optimal depth and fan-out of 2. This yields the fastest adder circuits to date (cf. [27]), but at the expense of a large size of ( log − 1) + 1. A number of additional constructions have been developed for adders, including special cases ( [2], [26]) of the one by Ladner and Fischer, cf. [31]. However, no other construction achieves asymptotically optimal depth and size.

MODEL AND PROBLEM
In this section, we discuss how to model metastability in a worst-case fashion and formally specify the input/output behavior of our circuits. Our model is a simplified version of the one from [10] for combinational circuits (cf. [9,Chap. 7]). This means to represent metastable "bits" by M and extend truth tables as in Kleene's 3-valued logic [20, §64]. M and i ∈ [1, B], denote by g i its i-th bit, i.e., g = g 1 g 2 . . . g B . We use the shorthand g i,j := g i . . . g j , where i, j ∈ [1, B] and i ≤ j. Let par(g) denote the parity of g ∈ B B , i.e, For a function f and a set A we abbreviate f (A) := {f (y) | y ∈ A}.

Binary Reflected Gray Code
A standard binary representation of inputs is unsuitable: uncertainty of the input values may be arbitrarily amplified by the encoding. E.g. representing a value unknown to be 11 or 12, which are encoded as 1011 resp. 1100, would result in the bit string 1MMM, i.e., a string that is metastable in every position that differs for both strings. However, 1MMM may represent any number in the interval from 8 to 15, amplifying the initial uncertainty of being in the interval from 11 to 12. An encoding that does not lose precision for consecutive values is Gray code.
We use B-bit binary reflected Gray code, rg B : [N ] → B B , which is defined recursively. For simplicity (and without loss of generality) we set N := 2 B . A 1-bit code is given by rg 1 (0) = 0 and rg 1 (1) = 1. For B > 1, we start with the first bit fixed to 0 and counting with rg B−1 (·) (for the first 2 B−1 codewords), then toggle the first bit to 1, and finally "count down" rg B−1 (·) while fixing the first bit again, cf.
As each B-bit string is a codeword, the code is a bijection and the encoding function also defines the decoding function. Denote by · : B B → [N ] the decoding function of a Gray code string, i.e., for x ∈ [N ], rg B (x) = x.
For two binary reflected Gray code strings g, h ∈ B B , we define their maximum and minimum as For example:

Valid Strings
The inputs to the sorting circuit may have some metastable bits, which means that the respective signals behave outof-spec from the perspective of Boolean logic. Such inputs, referred to as valid strings, are introduced with the help of the following operator. Valid strings have at most one metastable bit. If this bit resolves to either 0 or 1, the resulting string encodes either x or x + 1 for some x, cf. Table 2.
Proof. Let x ∈ B B M and let I be the set of indices where x is stable, i.e., i ∈ I iff x i = M. From Definition 3.4, we get that ∀i ∈ {1 . . . B} : i ∈ I ⇔ {x i } = res(x i ) .
Since by Definition 3.4 each combination of replacing Ms by 0s and 1s occurs in res( * S), we conclude that This proves the claim S ⊆ res( * S).
We observe that in general the reverse direction does not hold, i.e., res( * S) S. For example, consider S = {01, 10} and thus * S = MM such that res( * S) = {00, 01, 10, 11} = B 2 . Hence, S ⊆ res( * S) but not res( * S) ⊆ S. In contrast, for | res( * S)| ≤ 2, we can see that the reverse direction holds. Proof. Since * S can contain at most one M bit, we know that S can contain at most two strings that differ in one position. It is then straightforward to show that every string in res( * S) is element of S. Together with Observation 3.6 this shows the equality.
The metastable closure of an operator on binary inputs extends it to inputs that may contain metastable bits. This is done by considering all resolutions of the inputs, applying the operator, and taking the superposition of the results. The closure is the best one can achieve w.r.t. containing metastability with clocked logic using standard registers [10], i.e., when f M (x) i = M, no such implementation can guarantee that the i th output bit stabilizes in a timely fashion.

Output Specification
We want to construct a circuit computing the maximum and minimum of two valid strings, enabling us to build sorting networks for valid strings. First, however, we need to answer the question what it means to ask for the maximum or minimum of valid strings. To this end, suppose a valid string is rg B (x) * rg B (x + 1) for some x ∈ [N − 1], i.e., the string contains a metastable bit that makes it uncertain whether the represented value is x or x + 1. If we wait for metastability to resolve, the string will stabilize to either rg B (x) or rg B (x+1). Accordingly, it makes sense to consider rg B (x) * rg B (x + 1) "in between" rg B (x) and rg B (x + 1), resulting in the following total order on valid strings (cf. Table 2). Definition 3.9 (≺). We define a total order ≺ on valid strings as follows.
We extend the resulting relation on S B rg × S B rg to a total order by taking the transitive closure. Note that this also defines , via We intend to sort with respect to this order. It turns out that implementing a 2-sort circuit w.r.t. this order amounts to implementing the metastable closure of max rg and min rg .
If g ≺ h, Definitions 3.3 and 3.9 imply for all g ∈ res(g) and all h ∈ res(h) that g h (cf. Table 2). Observation 3.5 shows that * res(g) = g for any g ∈ S B rg . From Definition 3.8, we can thus conclude that max rg M {g, h} = * res(h) = h and min rg M {g, h} = * res(g) = g.

Computational Model and CMOS Logic
We seek to use standard components and combinational logic only. We use the model of [10], which specifies the behavior of basic gates on metastable inputs via the metastable closure of their behavior on binary inputs, cf. Table 3. We use the standard notational convention that a + b = OR M (a, b) and ab = AND M (a, b). Note that in this logic, most familiar identities hold: AND and OR are associative, commutative, and distributive, and DeMorgan's laws hold. However, naturally the law of the excluded middle becomes void. For instance, in general, OR(x,x) = 1, as OR(M, M) = M.
We now argue that basic CMOS gates behave according to this logic, justifying the model. For the sake of an intuitive notation, we apply some slightly unusual conventions. In the following, let R 1 be a wildcard that can refer to any resistance that is "low", i.e., close to being negligible, as e.g. that of a transistor in its stable conducting state (i.e., any PMOS transistor subjected to a low gate voltage or any NMOS transistor subjected to a high gate voltage). Similar, denote by R 0 any resistance that is "high", i.e., large compared to R 1 , such as the resistance of a transistor in its stable non-conducting state. Thus, with a stable input b ∈ B (where we identify 0 with low and 1 with high voltage), an NMOS transistor attains resistance R b , while a PMOS transistor attains resistance Rb. We can extend this to unstable inputs M by making the conservative assumption that R M is an arbitrary (possibly time-dependent) resistance.
With this notation, we can see that parallel and serial composition of transistors implements AND and OR in Kleene logic, respectively. Lemma 3.12. For k ∈ N sufficiently small so that kR 1 R 0 , let a 1 , . . . , a k ∈ B M be input signals fed to k NMOS transistors interconnected (i) in parallel or (ii) sequentially. Set σ := k i=1 a i and π := k i=1 a i , i.e., the OR resp. AND over all inputs. Then the resistance between input and output of the resulting subcircuit is (roughly) (i) R σ resp. (ii) R π .
Proof. Denote by R the resistance between the input and output of the subcircuit. Suppose first that σ = 0, i.e., a i = 0 for all i. Then, for parallel composition, we get that Now consider sequential composition and suppose first that π = 1, i.e., a i = 1 for all i.
The same arguments apply to PMOS transistors.

Corollary 3.13.
For k ∈ N sufficiently small so that kR 1 R 0 , let a 1 , . . . , a k ∈ B M be input signals fed to k PMOS transistors interconnected (i) in parallel or (ii) sequentially. Set σ := k i=1ā i and π := k i=1ā i , i.e., the OR resp. AND over all inputs. Then the resistance between input and output of the resulting subcircuit is (roughly We remark that the factor of k reduction in the gap between R 1 and R 0 may imply that a gate's output signal needs to be regenerated using a buffer. However, this is the same behavior as for logic that assumes stable signals only, so standard CMOS design techniques account for this.
From the above observations, we can readily infer that standard CMOS gate implementations behave according to Kleene logic in face of potentially metastable signals, justifying the model from [10].
Theorem 3.14. The CMOS gates depicted in Figure 1 implement the truth tables given in Table 3.
Proof. The output of the gates is 1 (high voltage) if the resistances from V DD and V SS to the output are low (i.e., roughly R 1 ) and high (R 0 ), respectively. Similarly, it is 0 if the roles are reversed. Thus, Lemma 3.12 and Corollary 3.13 show the claim for stable entries of the truth tables. For the unstable ones, setting R M (which is a wildcard for an arbitrary resistance) to R 0 or R 1 , respectively, leads to different outcomes. Thus, the output voltage may attain almost any value between V DD and V SS , i.e., the output is M.
Similar reasoning applies to many gates, e.g., NAND and NOR gates. We stress, however, that the property of implementing the closure of the function computed by the gate on stable values is not universal for CMOS logic. For instance, standard transistor-level multiplexer implementations do not handle metastability well, cf. [11].

DECOMPOSITION OF THE TASK
In this section, we show that computing max rg M {g, h} and min rg M {g, h} for valid strings g, h ∈ S B rg can be broken down into composing simple operators in Figure 2 depicts a finite state machine performing a fourvalued comparison of two Gray code strings. In each step of processing inputs g, h ∈ B B , it is fed the pair of i th input bits g i h i . In the following, we denote by s (i) (g, h) the state of the machine after i steps, where s (0) (g, h) := 00 is the starting state. For ease of notation, we will omit the arguments g

Comparing Stable Gray Codes via an FSM
g h and h of s (i) whenever they are clear from context. Table 4 shows an example of a run of the finite state machine. Because the parity keeps track of whether the remaining bits are to be compared w.r.t. the standard or "reflected" order, the state machine performs the comparison correctly w.r.t. the meaning of the states indicated in Figure 2.
Proof. We show the claim by induction on i. It holds for i = 0, as s (i) = 00, g 1,0 = h 1,0 is the empty string, and g ≺ h if and only if g 1,B = g ≺ h = h 1,B . For the step from i − 1 ∈ [B] to i, we make a case distinction based on s (i−1) . s (i−1) = 00: By the induction hypothesis, , the definition implies that g ≺ h regardless of further bits, and if g i h i = 10, g h regardless of further bits. s (i−1) = 11: Analogously to the previous case, noting that reflecting a second time results in the original order. s (i−1) = 01: By the induction hypothesis, g ≺ h. As 01 is an absorbing state, also s (i) = 01. s (i−1) = 10: By the induction hypothesis, g h. As 10 is an absorbing state, also s (i) = 10.
This lemma gives rise to a sequential implementation of 2-sort(B) based on the given state machine, for input strings  in B B . Table 5 lists the i th output bit as function of s (i−1) and the pair g i h i . Correctness of this computation follows immediately from Lemma 4.1.
We can express the transition function of the state machine as an (as easily verified) associative operator taking the current state and input g i h i as argument and returning the new state. Then s (i) = s (i−1) g i h i , where is given in Table 6a and s (0) = 00. The out operator is derived from Table 5 by evaluating max rg {g, h} i and min rg {g, h} i for all possible values of g i h i ∈ B 2 . Noting that s (0) x = 00 x = x for all x ∈ B 2 , we arrive at the following corollary.
Our goal in this section is to extend this approach to potentially metastable inputs.

Dealing with Metastable Inputs
Our strategy is to replace all involved operators by their metastable closure: Table 5, and finally (iii) exploit associativity of the operator computing the state s The reader may ask why we compute s with a simple tree of M elements, which would yield a smaller circuit. Since s is the result of the comparison of the entire strings, it could be used to compute all outputs, i.e., we could compute the output by out M (s However, in case of metastability, this may lead to incorrect results. This can be seen in the example run of the FSM given in Table 7. We thus compute every intermediate state s Unfortunately, even with this modification it is not obvious that our approach yields correct outputs. There are three hurdles to overcome: (P1) Show that M is associative. This yields out M (1M, M0) = * {00, 01, 10} = MM as second output, but out M (00, M0) = * {00, 10} = M0 is correct.
While it is tractable to manually verify all 3 6 = 729 cases (exploiting various symmetries and other properties of the operator), it is tedious and prone to errors. Instead, we verified that both evaluation orders result in the same outcome by a short computer program. Apart from being essential for our construction, this theorem simplifies notation; in the following, we may write where the order of evaluation does not affect the result.
We stress that in general the closure of an associative operator needs not be associative. A counter-example is given by binary addition modulo 4:

Determining s (i) M
For convenience of the reader, Table 8 gives the truth table We need to show that repeated application of this operator to the input pairs g j h j , j ∈ [1, i], actually results in s  Proof. List the codewords in order. By the recursive definition of the code, removing the first m − 1 bits of the code leaves us with 2 m−1 repetitions of (B − m + 1)-bit code alternating between listing it in order and in reverse ("reflected") order. Also by the recursive definition, the m th bit toggles only when the (B − m)-bit code resulting from removing it is at its last codeword, 10 B−m−1 .
Our reasoning will be based on distinguishing two main cases: one is that s   Table 6a). We conclude that s Hence, suppose that the claim has been established for i − 1 ∈ [1, B − 1] and consider index i. If res s (i−1) M ≤ 2, Observation 4.5 and the induction hypothesis yield that It remains to consider the case that s  Recall that out : B 2 ×B 2 → B 2 is the operator given in Table 5 computing max rg {g, h} i min rg {g, h} i out of s (i−1) and g i h i . For convenience of the reader, we provide the truth table of Table 9. We derive the third property.
Proof. Assume first that res s Then, by Lemma 4.6, g = h. In particular, g i = h i . Checking Table 9, we see that for all b ∈ B M , it holds that out M (MM, bb) = bb. Therefore, M {g, h} i min rg {g, h} i in this case as well.

THE PPC FRAMEWORK
In order to derive a small circuit from the results of Section 4, a straightforward approach would be to unroll the FSM. We could design a circuit implementing the transition function M and apply it B times to the starting state s (0) and each input g i h i . However, computing the sequence of states step by step yields a (non-optimal) linear depth of at least B.
Hence, we make use of the PPC framework by Ladner and Fischer [23]. They describe a generic method that is applicable to any finite state machine translating a sequence of B input symbols to B output symbols, to obtain circuits of size O(B) and depth O(log B). They observe that each input symbol defines a restricted transition function.
Compositions of these functions evaluated on the starting state yield the state of the machine after receiving corresponding inputs. The major advantage of the technique is that compositions of restricted transition functions can be computed in parallel due to associativity, yielding a depth of O(log B). This matches our needs, as we need to determine s  Table 10 for s (i) M (g, h) and the output. We labeled each M by its output. Buffers and duplicated gates (here the one computing 0M) reduce fan-out, but do not affect the computation. Grey boxes indicate recursive steps of the PPC construction; see also Figure 7 for a larger PPC circuit using the one here in its "right" top-level recursion. For better readability, wires not taking part in a recursive step are dashed or dotted.  Figure 2 on inputs g = 101010110 and h = 101M10000. We drop s (9) M , as it is not needed to compute g 9 h 9 .
. Accordingly, we discuss these templates only. During discussion of the basic construction we show a minor improvement on their results.
Before proceeding, the reader may want to take a look at the example given in Figure 3, which shows how a 2-sort(9) derived from our construction processes an input pair.

The Basic Construction
We revisit the templates for parallel computation of all prefixes, i.e., the part of the framework relevant to our construction. To this end, recall Definition 1.1. In our case, ⊕ = M and D = B 2 M . [23] provides a family of recursive constructions of PPC ⊕ circuits. They are obtained by combining two different recursive patterns. The first pattern, which optimizes for size of the resulting circuits, is depicted in Figure 4a. We distinguish between even and odd number of inputs. If B is even, we discard the rightmost gray wire and setB := B; if B is odd, we setB := B − 1 and include the rightmost wire. In the following, denote by |C| the size of a circuit C and by d(C) its depth. Lemma 5.1. Suppose that C and P are circuits implementing ⊕ and PPC ⊕ ( B/2 ) for some B ∈ N, respectively. Then applying the recursive pattern given at the left of Figure 4 yields a PPC ⊕ (B) circuit. It has depth 2d(C) + d(P ) and size at most (B − 1)|C| + |P |. Moreover, the last output is at depth at most d(C) + d(P ) of the circuit.
Proof. Observe that P receives as inputs d 2i−1 ⊕ d 2i for i ∈ [1, B/2 ], and in addition d B in case B is odd. Thus, it outputs π i = 2i j=1 d j for i ∈ [1, B/2 ], and also π B/2 = B j=1 d j if B is odd. Hence, the circuit outputs is even and is odd, showing correctness. The depth of the circuit is immediate from the construction, and the size follows from the fact that there is exactly one instance of C for each even i ∈ [1, B] before P and one for each odd i ∈ [1, B] \ {1, B} after P . Output π B has a depth that is smaller by d(C), as it is an output of P .
The second recursive pattern, shown in Figure 4c, avoids to increase the depth of the circuit beyond the necessary d(C) for each level of recursion. Assume for now that B is a power of 2. We represent the recursion as a tree T b , where b := log B, given in the center of Figure 4. It has depth b with all leaves (filled in white) in this depth, and there are two types of non-leaf nodes: right nodes (filled in black) have two children, a left and a right node, whereas left nodes (filled in gray) have a single child, which is a right node. T b is essentially a Fibonacci tree in disguise.
Definition 5.2. T 0 is a single leaf. T 1 consists of the (right) root and two attached leaves. For b ≥ 2, T b can be constructed from T b−1 and T b−2 by taking a (right) root r, attaching the root of T b−1 as its right child, a new left node as the left child of r, and then attaching the root of T b−2 as (only) child of .
The recursive construction is now defined as follows. A right node applies the pattern given in Figure 4 to the right. R is the circuit (recursively) defined by the subtree rooted at the left child and R r is the circuit (recursively) defined by the subtree rooted at the right child. Furthermore, is the depth of the node. A left child applies the pattern on the left. R c is (recursively) defined by the subtree rooted at its child andB is the depth of the node.
The base case for a single input and output is simply a wire connecting the input to the output, for both patterns. As b = log B and each recursive step cuts the number of inputs and outputs in half, the base case applies if and only if the node is a leaf. Note that the figure shows the recursive patterns at the root and its left child, whereB = 2 b−1 is always even (i.e., in this recursive pattern, the gray wire with indexB + 1 is never present); when applying the patterns to nodes further down the tree,B and B are scaled down by a factor of 2 for every step towards the leaves.
In the following, denote by PPC(C, T b ) the circuit that results from applying the recursive construction described above to the base circuit C implementing ⊕. Moreover, we refer to the i th input and output of the subcircuit corresponding to node v ∈ T b as d v i and π v i , respectively.
Proof. We show the claim by induction on b. For b = 0, the circuit correctly wires the input to the output, as we have only one leaf. For b = 1, the first output equals the first input and the second output is the result of feeding both inputs into a copy of C.
For b ≥ 2, by the induction hypothesis the circuit R c used in the construction at the left child of the root is a PPC ⊕ (2 b−2 ) circuit. By Lemma 5.1, the circuit R in the construction at the root is thus a PPC ⊕ (2 b−1 ) circuit, showing that it outputs π i = i j=1 d j = π i for all i ∈ [1, 2 b−1 ]. From the induction hypothesis for b − 1, we get that the circuit R r used in the construction at the root is a PPC ⊕ . By construction of the right recursion pattern we conclude that for i ∈ [2 b−1 + 1, 2 b ], we get the outputs Proof. We prove the claim by induction on b; it trivially holds for b = 0, as we have only one leaf. For b = 1, T b is a right node with two leaves. The two leaves have depth 0; clearly, applying the right pattern from Figure 4 then results in depth d(C). For b ≥ 2, the subcircuit R r at the root has depth (b − 1) · d(C) by the induction hypothesis. For the subcircuit R at the root, consider its subcircuit R c . By the induction hypothesis it has depth (b − 2) · d(C). Hence, by Lemma 5.1, R has depth b · d(C), but its rightmost output π 2 b−1 has depth only (b − 1) · d(C). Thus, by construction the root's circuit has depth b · d(C).
It remains to bound the size of the circuit. Denote by F i , i ∈ N, the i th Fibonacci number, i.e., F 1 = F 2 = 1 and F i+1 = F i + F i−1 for all 2 ≤ i ∈ N. Lemma 5.5. PPC(C, T b ) has size (2 b+2 − F b+5 + 1)|C|.
Proof. Denote by s b the number of copies of |C| in the circuit PPC(C, T b ). We show by induction that s b = 2 b+2 −F b+5 +1 for all b ∈ N 0 . We have that s 0 = 0 = 2 2 − F 5 + 1 and that s 1 = 1 = 2 3 − F 6 + 1. For b ≥ 2, we have that s b = s b−1 + s b−2 +s r +s , where s r and s denote the size contribution of the recursive steps at the root and its left child, respectively. Checking the recursive patterns in Figure 4, we see that s r = B −B = 2 b−1 and s =B − 1 = 2 b−1 − 1. Thus, s b = s b−1 + s b−2 + 2 b − 1, which the induction hypothesis yields Asymptotically, the subtractive term of F b+5 is negligible, as F b+5 ∈ (1/ √ 5 + o(1))((1 + √ 5)/2) b+5 ⊆ O(1.62 b ); however, unless B is large, the difference is substantial. We also get a simple upper bound for arbitrary values of B. To this end, we "split" in the recursion such that the left branch (c) Recursion pattern of right nodes. is "complete" (i.e. the number of inputs is a power of 2), while applying the same splitting strategy on the right. This is where our construction differs from and improves on [23]. They perform a balanced split and obtain an upper bound of 4B on the circuit size.
Corollary 5.6. For B ∈ N and circuit C implementing ⊕, set b := log B . Then a PPC ⊕ (B) of depth log B d(C) and size smaller than Proof. If B is a power of 2, the claim follows from Lemmas 5.3, 5.4, and 5.5. In particular, for B = 1 and B = 2, respectively, PPC ⊕ (C, T 0 ) and PPC ⊕ (C, T 1 ) meet the requirements. For B > 2 that is not a power of 2, set b := log B and perform the same construction as for PPC(C, T b ), but replace R r at the root by the (recursively given) PPC ⊕ (B − 2 b−1 ) circuit. Correctness is immediate from the recursive construction and Lemma 5.3. Similarly, the depth bound follows from Lemma 5.4 and the recursive construction. Regarding size, we show by induction that s B , the number of copies of C required for a circuit for B inputs, satisfies the claimed bound. This is already established for the base cases of B = 1 and B = 2. For B > 3, we apply Lemma 5.1 to R in the root's circuit and Lemma 5.5 to its subcircuit R c , while applying the induction hypothesis to the subcircuit R r in the root's circuit. We get that We remark that one can give more precise bounds by making case distinctions regarding the right recursion, which for the sake of brevity we omit here. Instead, we computed the exact numbers for B ≤ 70, see Figure 5.
The construction derived from iterative application of Lemma 5.1 can be combined with PPC(C, T b ), achieving the following trade-off; note that if B = 2 b for b ∈ N, then F log B −k+3 can be replaced by F b−k+5 . Theorem 5.7 (improving on [23]). Suppose C implements ⊕. For all k ∈ [0, log B ] and B ∈ N, there is a PPC ⊕ (B) circuit of depth ( log B + k)d(C) and size at most Proof. For k steps, we apply Lemma 5.1, where in the final recursive step we use the circuit from Corollary 5.6. Correctness is immediate from Lemma 5.1 and Corollary 5.6.
Denote by B i , i ∈ [k + 1], the number of in-and outputs of the (sub)circuit at depth i of the recursion, i.e., B 0 = B and B i+1 = B i /2 for all i ∈ [k]. We have that B i ≤ B/2 i + i j=1 2 −j < B/2 i + 1, which follows inductively via Observe that log B i+1 = log B i − 1 for all i ∈ [k]. By Lemma 5.1 and Corollary 5.6, the size of the resulting circuit is thus bounded by   [23] and ours. The curves for unbounded fan-out are the exact sizes obtained, whereas "upper bound" refers to the bound from Corollary 5.6; the fan-out 3 curves show that the unbalanced strategy performs better also for the construction from Theorem 5.17 (for f = 3 and k = 0) we derive next.

Constant Fan-out at Optimal Depth
The optimal depth construction incurs an excessively large fan-out of Θ(B), as the last output of left recursive calls needs to drive all the copies of C that combine it with each of the corresponding right call's outputs. This entails that, despite its lower depth, it will not result in circuits of smaller physical delay than simply recursively applying the construction from Figure 4a. Naturally, one can insert buffer trees to ensure a constant fan-out (and thus constantly bounded ratio between delay and depth), but this increases the depth to Θ(log 2 B + d(C) log B).
We now modify the recursive construction to ensure a constant fan-out, at the expense of a limited increase in size of the circuit. The result is the first construction that has size O(B), optimal depth, and constant fan-out.
In the following, we denote by f ≥ 3 the maximum fanout we are trying to achieve, where we assume that gates or memory cells providing the input to the circuit do not need to drive any other components. For simplicity, we consider C to be a single gate, i.e., a gate driving two C components has exactly fan-out 2.
We proceed in two steps. First, we insert 2B buffers into the circuit, ensuring that the fan-out is bounded by 2 everywhere except at the gate providing the last output of each subcircuit corresponding to a left node. In the second step, we will resolve this by duplicating these gates sufficiently often, recursively propagating the changes down the tree. Neither of these changes will affect the output (i.e. the correctness) of the circuit or its depth, so the main challenges are to show our claim on the fan-out and bounding the size of the final circuit.

5.2.1
Step 1: Almost Bounding Fan-out by 2 Before proceeding to the construction in detail, we need some structural insight on the circuit. Definition 5.8. For node v ∈ T b , define its range R v and leftcount α v recursively as follows.
• If v is the root, then • If v is the right child of left node p, then R v = R p and α v = α p + 1.
Hence, the left-count α v tells us for every node v ∈ T b the number of left recursion steps preceding v, whereas R v gives us information about the range of inputs used at node v. We observe that each recursion halves the number of inputs and that the range is only cut in half if α v does not increase. Combining these observations with structural insights on the recursion patterns in Figures 4a and 4c, we state the following four properties of PPC(C, T b ).
if v is a right node, all its inputs are outputs of its childrens' subcircuits, and (iv) if v is a left node or leaf, only its even inputs are provided by its child (if it has one) and for odd k ∈ [1, Proof. Property (i) is immediate from the fact that with each step of recursion, the number of input and output wires is cut in half. Checking the above definition, we see that the range stays the same whenever α v increases, and otherwise is halved on each recursive step; Property (ii) follows. Property (iii) can be readily verified from Figure 4c. The final property is shown by induction on b. It is immediate for b = 0 and b = 1. For b ≥ 2, the subcircuit of the left child of the root has 2 b−1 inputs, the odd ones of which are inputs to the overall circuit (cf. Figures 4a and 4c). As we have α v = 0, we get that d k = d i+k = i+k2 αv −1 k =i+(k−1)2 αv d k and the node satisfies the claim. For the subcircuit R r corresponding to the subtree rooted at the right child of the root, the claim holds by the induction hypothesis applied to b − 1. For the subcircuit R of the left child, we see from Figure 4a that the subcircuit R c corresponding to the subtree rooted at its child c receives inputs d c k = d 2k−1 ⊕d 2k , k ∈ [1, 2 b−1 ]. Combining this with the induction hypothesis for b−2, the induction step is completed also in this case. Lemma 5.9 leads to an alternative representation of the circuit PPC(C, T b ), see Figure 6, in which we separate gates in the recursive pattern from Figure 4a that occur before the subcircuit R c . Adding the buffers we need in our construction, this results in the modified patterns given in Figure 6b. The separated gates appear at the bottom of Figure 6a: for each leaf v of T b , there is a tree of depth α v aggregating all of the circuit's inputs from its range. Each non-root node in an aggregation tree provides its output to its parent. In addition, one of the two children of an inner node in the tree must provide its output as an input to one of the subcircuits corresponding to a node of T b , cf. Property (iv) of Lemma 5.9.
From this representation, we will derive that the following modifications of PPC(C, T b ) result in a PPC ⊕ (2 b ) circuit PPC(C, T b ) , for which a fan-out larger than 2 exclusively occurs on the last outputs of subcircuits corresponding to nodes of (a) Recursion tree T4 with separated aggregation trees and added buffers.
Recursive patterns Rr and R . Fig. 6: Construction of PPC(C, T 4 ) . On the left, we see the recursion tree, with the aggregation trees separated and shown at the bottom. Inputs are depicted as black triangles. On the right, the application of the recursive patterns at the children of the root is shown. Parts marked blue will be duplicated in the second step of the construction that achieves constant fan-out; this will also necessitate to duplicate some gates in the aggregation trees.
1) Add a buffer on each wire connecting a non-root node of any of the aggregation trees to its corresponding subcircuit (see Figure 6a). 2) For the subcircuit corresponding to left node with range R = [i, i + j], add for each even k ≤ j (i.e., each even k but the maximum of j + 1) a buffer before output π k (see bottom of Figure 6b). 3) For each right node r with range [i, i + j], add a buffer before output π r (j+1)/2 (see top of Figure 6b). Lemma 5.10. With the exception of gates providing the last output of subcircuits corresponding to nodes of T b (blue in Figure 6b), fan-out of PPC(C, T b ) is 2. Buffers or gates driving an output of the circuit drive nothing else.
Proof. First, we prove the following invariant: If each input to a subcircuit of PPC(C, T b ) corresponding to a node of T b is driven by a gate or buffer driving no other wires, the same holds true for the outputs of the subcircuit. Suppose for contradiction that this invariant is violated and consider a minimal subcircuit doing so. There are three cases.
• The subcircuit corresponds to a leaf. This is a contradiction, as then the subcircuit simply is a wire connecting its sole input to its output. • The subcircuit corresponds to a right node r (cf. top of Figure 6b) with range [i, i + j]. As the invariant applies to the subcircuit corresponding to its left child, its outputs π r 1 , . . . , π r (j−1)/2 satisfy the invariant. Its output π r (j+1)/2 satisfies the invariant due to the inserted buffer. As the remaining outputs are driven by gates that drive nothing else, this case also leads to a contradiction. • The subcircuit corresponds to a left node (cf. bottom of Figure 6b) with range [i, i + j]. As d 1 is simply wired to π 1 (and nothing else), it satisfies the invariant. The last output π j+1 satisfies the invariant, because the recursively used subcircuit does. The remaining outputs are driven by gates or buffers driving nothing else, again resulting in a contradiction.
As all cases result in a contradiction, the invariant holds.
Next, observe that, by construction, the aggregation trees have fan-out 2 after buffer insertion. Each buffer or gate from this part of the circuit drives exactly one wire connecting it to the remainder of the circuit. Thus, the above invariant shows that all subcircuits corresponding to nodes of T b satisfy that each of their outputs is driven by gate or a buffer driving nothing else. Checking Figure 6b, we can thus conclude that indeed no gate or buffer drives more than two others, unless it provides the last output to one of the recursively used subcircuits in the construction at a right node; gates or buffers driving an output of PPC(C, T b ) drive only this output.
It remains to count the inserted buffers. We do so by computing a closed form expression from the linear recurrence that describes the number of nodes of a given type (left, right, leaf) in a given depth as function of the previous one. The following helper statement will be useful for this, but also later on. Proof. We have that |L 0 | = 1 = F 2 , |L 1 | = 2 = F 3 , and, by Next, consider the recurrence given by L 0 = 1, L 1 = 2, and L b = L b−1 + 2L b−2 for b ≥ 2; the factor of 2 assigns twice the weight to the subtree rooted at the child of the root's left child, thereby ensuring that each leaf is accounted with weight 2 αv . This recurrence has solution 2 b . Lemma 5.12. Denote by s the size of a buffer. Then To count the number of buffers attached to the aggregation trees, recall that from depth d = 0 of each tree, exactly half of the nodes' output is required by a buffer connected to some node in T b (this follows from Lemma 5.9 and the fact that ranges of nodes in the same depth of T b form a partition of [1, 2 b ]). Thus, this number equals v∈L b 2 αv−1 − 1. By Lemma 5.11, the total number of buffers thus equals Similar arguments serve later as well. The main reason why we will define the function a(v) in the next section without rounding is to ensure that we again obtain linear recurrences, which can be solved using standard techniques from linear algebra. As a downside, this results in slightly overestimating the size of circuits, as we may ask for more copies of gates from children than are actually needed.

5.2.2
Step 2: Bounding Fan-out by f In the second step, we need to resolve the issue of high fanout of the last output of each recursively used subcircuit in PPC(C, T b ) . Our approach is straightforward. Starting at the root of T b and progressing downwards, we label each node v with a value a(v) that specifies a sufficient number of additional copies of the last output of the subcircuit represented by v to avoid fan-out larger than f . At right nodes, this is achieved by duplicating the gate computing this output sufficiently often, marked blue in Figure 6b (top). For left nodes, we simply require the same number of duplicates to be provided by the subcircuit represented by their child (i.e., we duplicate the blue wire in the bottom recursive pattern shown in Figure 6b). Finally, for leaves, we will require a sufficient number of duplicates of the root of their aggregation tree; this, in turn, may require to make duplicates of their descendants in the aggregation tree.
We define a(v) and then utilize it to describe our fan-out f circuit. Afterwards, we will analyze the increase in size of the circuit compared to PPC(C, T b ) .
if v is the right child of right node p a(p) if v is the (only) child of left node p.
Lemma 5.14. Suppose that for each leaf v ∈ T b , there are a(v) additional copies of the root of the aggregation tree, and for each right node v ∈ T b , we add a(v) gates that compute (copies of) the last output of their corresponding subcircuit of PPC(C, T b ) . Then we can wire the circuit such that all gates that are not in aggregation trees have fan-out at most f , and each output of the circuit is driven by a gate or buffer driving only this output. and y d = x d+1 − x d /f , where the asymptotic bound on x d uses that for f ≥ 3 and all i ∈ [1,4], λ 1 ≥ |λ i |. Therefore, Observe that for f ≥ 3, 0 = |λ i | < 1 for all i; thus, . Summation of the individual terms and multiplication by |C| proves claim of the theorem.
As an example for the overall resulting construction, we show PPC (3) (C, T 4 ) in Figure 7. We summarize our findings in the following theorem.
Proof. We argue as for Theorem 5.7, but replace PPC(C, T b ) by PPC (f ) (C, T b ), and need to make sure that we modify the first k steps of the recursion, where we apply the construction from Figure 4a, such that the fanout is at most f . In fact, we will ensure a fanout of 2 for this part of the construction. To this end, we simply add a buffer before each output that is not directly driven by a copy of C, as already indicated in the figure. This guarantees the invariant that all inputs to and outputs of subcircuits are driven by an element not driving anything else; for the PPC (f ) (C, T b ) subcircuit, this invariant holds by Lemma 5.10. This adds in total 2 b − 2 b−k buffers to the circuit: one for each output wire minus one for each output wire of PPC (f ) (C, T b ). Thus, by Theorem 5.7, the size of the circuit is bounded by By Lemmas 5.12 and 5.16, ∆ is bounded by Summation yields the claimed bound on the size of the circuit. The depth bound and that we indeed get a PPC ⊕ (2 b ) circuit follow as in Theorem 5.7, as the modifications to the construction affected neither its depth nor its output.
We refrain from analyzing the size of the construction for values of B that are not powers of 2. However, in Figure 8 we plot the exact bounds (without buffers) for k = 0 and selected values of f against B.

SIMULATION
Separate from and in addition to the proofs from the previous sections, we verify the correctness of our circuits by VHDL simulation. To this end, we first need to specify implementations of the subcircuits computing M and out M .

Gate-Level Implementation of Operators
From Tables 6a and 6b, for s, b ∈ B 2 we can extract the Boolean formulas In general, realizing a Boolean formula f by replacing negation, multiplication, and addition by inverters, AND, and OR gates, respectively, does not result in a circuit implementing f M . 1 However, we can easily verify that the above formulas are disjunctions of all prime implicants of their respective functions. As shown in [10] (see also [16]), 2 in this special case the resulting circuits do implement the closure -provided the gates behave as in Table 3, which the implementations given in Figure 1 do by Theorem 3.14.
Using distributive laws (recall that these also hold in Kleene logic), the above formulas can be rewritten as We see that, in fact, a single circuit with suitably wired (and possibly negated) inputs can implement all four operations. As for sel 1 = sel 2 the circuit implements a multiplexer with select bit sel 1 , we refer to it as extended multiplexer, or XMUX for short. Its functionality is specified by XMUX(sel 1 , sel 2 , x, y) := y(x + sel 2 ) + x sel 1 . Figure 9 shows the resulting circuit, and Table 11 lists how to map inputs to compute M and out M .
We note that this circuit is not a particularly efficient XMUX implementation; a transistor-level implementation would be much smaller. However, our goal here is to verify correctness and give some initial indication of the size of the resulting circuits -a fully optimized ASIC circuit is beyond 1. For instance, (s b) 1 = s 1b1 +s 2 b 1 as Boolean formula, but the two expressions differ when evaluated on s 1 =s 2 = 1 and b 1 = M. The circuits resulting from the different formulas are implementations of a multiplexer (with select bit b 1 ) and its closure, respectively.
2. Alternatively, one can manually verify that these formulas evaluate to the truth tables given in Tables 8 and 9.
x y sel1 sel2 Fig. 9: XMUX circuit, used to implement M and out M .
the scope of this article. In [4], the size of the implementation is slightly reduced by moving negations. Due to space limitations, we refrain from detailing this modification here, but note that Figure 12 and Table 12 take it into account.

Putting it All Together
We now have all the pieces in place to assemble a containing 2-sort(B) circuit. By Theorem 4.3, M is associative. Thus, from a given implementation of M (e.g., two copies of the circuit from Figure 9 with appropriate wiring and negation, cf. Table 11) we can construct PPC M (B − 1) circuits of small depth and size, as shown in Section 5. We can combine such a circuit with an out M implementation (again, two XMUXes with appropriate wiring and negation will do) as shown in Figure 10 to obtain our 2-sort(B) circuit.

Simulation Setup
We implemented the design given in Figure 10 on registertransfer-level using the PPC M (B − 1) circuit given by Theorem 5.7 for k = 0. 3 Quartus by Altera is used for design entry, which in our case mainly consists of checking correct implementation. After design entry we use ModelSim by Altera for behavioral simulation. Note that we must not simulate the preprocessed Quartus output, because processing may compromise metastability-containing behavior. Instead, we simulate pure VHDL. Metastable signals are simulated using VHDL signal X, because its behavior matches the worst-case behavior assumed for M. The correctness of this construction follows from Theorems 4.7 and 4.8, where we can plug in any PPC M (B − 1) circuit, cf. Section 5. For the circuits derived by relying on the XMUX circuit from Figure 9, we independently confirmed this via simulation.

Results
For the implementation of PPC M (B − 1) we used the circuits from Theorem 5.7, i.e., we did not make use of the 3. For k > 0, fan-out becomes an issue, requiring the more involved constructions provided by Theorem 5.17. However, the resulting numbers would be inaccurate, and a detailed comparison based on optimized ASIC implementations is beyond the scope of this work.  extension to constant fan-out. We exhaustively checked the design from Figure 10 for B up to 12 (and all feasible k). Simulation shows that the design works correctly for several levels of recursion, e.g., when regarding B = 1 and B = 2 as simple base cases, B = 12 implies 3 levels of recursion for both patterns. We refrained from simulating the constant fan-out construction, because it simply replicates intermediate results without changing functionality.

Comparison to Baseline
After behavioral simulation, we continue with a comparison of our design and a standard sorting approach Bin-comp(B). As mentioned earlier, the 2-sort(B) implementation given in Figure 10 is slightly optimized by pulling out a negation from the operators in every recursive step [4]. After design entry as described above, we use Encounter RTL Compiler for synthesis and Encounter for place and route. Both tools are part of the Cadence tool set and in both steps we use NanGate 45 nm Open Cell Library as a standard cell library.
Since metastability-containing circuits may include additional gates that are not required in traditional Boolean logic, Boolean optimization may compromise metastabilitycontaining properties [3]. Accordingly, we were forced to disable optimization during synthesis of the circuits. input width B Fig. 12: Comparison of our solution PPC Sort to a standard non-containing one. For the latter, the unexpected delay reduction at B = 16 is the result of automatic optimization with more powerful gates, which our solution does not use.
Binary Benchmark Bin-comp: In short, Bin-comp consists of a simple VHDL statement comparing two binary encoded inputs and outputting the maximum and the minimum, accordingly. It follows the same design process as 2-sort, but then undergoes optimization using a more powerful set of basic gates. For example, the standard cell library provides prebuild multiplexers. These multiplexers are used by Bin-comp, but not by 2-sort, as they are not metastability-containing. We stress that these more powerful gates provide optimized implementations of multiple Boolean functions, yet each of them is still counted as a single gate. Thus, comparing our design to the binary design in terms of gate count, area, and delay disfavors our solution. Moreover, we noticed that the optimization routine switches to employing more powerful gates when going from B = 8 to B = 16 (cf. Figure 12), resulting in a decrease of the delay of the Bin-comp implementation.
Nonetheless, our design performs comparably to the non-containing binary design in terms of delay, cf. Figure 12 and Table 12. This is quite notable, as further optimization is possible by optimizing our design on the transistor level, with significant expected gains. The same applies to gate count and area, where a notable gap remains. Recall, however, that the Bin-comp design hides complexity by using more advanced gates and does not contain metastability.
We emphasize that we refrained from optimizing the design by making use of all available gates or devising transistor-level implementations, as such an approach is tied to the utilized library or requires design of standard cells.

CONCLUSIONS
In this work, we demonstrated that efficient metastabilitycontaining sorting circuits are possible. Our results indicate that optimized implementations can achieve the same delay as non-containing solutions, without a dramatic increase in circuit size. This is of high interest to an intended application motivating us to design MC sorting circuits: faulttolerant high-frequency clock synchronization. Sorting is a key step in envisioned implementations (cf. [10], [15]) of the Lynch-Welch algorithm [30] with improved precision of synchronization. The complete elimination of synchronizer delay is possible due to the efficient MC sorting networks presented in this article; enabling an increment of the rate at which clock corrections are applied, significantly reducing the negative impact of phase drift of local clock sources on the precision of the algorithm (cf. [18]). This goal will necessitate to devise optimized ASIC implementations of our circuits. The novel PPC circuits we devised in Section 5 are an important contribution towards this end. Note that it is crucial to take into account both depth and fan-out for devising low-delay circuits. Hence, follow-up work needs to compare the existing and our novel design based on suitable metrics that take both into account to reliably predict the achieved trade-offs between delay, area, and energy consumption of circuits. Note that this is of relevance beyond the specific application of MC sorting: PPC circuits lie at the heart of adder designs, implying that even a minor improvement can have significant impact on the overall performance of computing devices! MC Control Loops: More generally speaking, MC circuits like those presented here are of interest in mixedsignal control loops whose performance depends on very short response times. When analog control is not desirable, traditional solutions incur synchronizer delay before being able to react to any input change. Using MC logic saves the time for synchronization, while metastability of the output corresponds to the initial uncertainty of the measurement; thus, the same quality of the computational result can be achieved in shorter time. Note that our circuits are purely combinational, so they can be used in both clocked and asynchronous control logic.
Obvious examples of such control loops are clock synchronization circuits, but MC has been shown to be useful for adaptive voltage control [13] and fast routing with an acceptable low probability of data corruption [29] as well. This type of application suggests to explore whether efficient circuits exist for a wider range of arithmetic operations, like e.g. addition or (possibly approximate) multiplication.
Redundant Encoding and Addition: On the theoretical side, our results are to be contrasted with the exponential gap between the size of non-containing and MC circuits shown in [17]. This work raised the question for which classes of functions small MC circuits exist. Given that Ladner and Fischer proved that the PPC task can be solved efficiently for any constant-sized state machine [23], it was natural to ask whether this result can be extended to MC computations. In follow-up work, we show that indeed this holds true for any constant-sized FSM [5]. However, when applying this result to addition, unlike for sorting (where the underlying operations are max and min) uncertainty of inputs adds up. This means that Gray code can support meaningful computations only if the total uncertainty of all addends is at most 1.
Accordingly, in [5] we also consider redundant encodings, showing that using k (roughly) redundant bits, an uncertainty of (k + 1)/2 can be tolerated without loss of precision. Combined with the above result on transducers, this yields a meaningful notion of MC addition that allows for efficient circuits. As, essentially, the redundant bits are used as a unary code, it should be straightforward to apply the techniques from this article to obtain efficient sorting circuits with the encoding from [5]. We remark that the encoding from [5] turns out to be identical to that of the output of suitable time-to-digital converters [12], so relaxing their output constraints to achieve better average-case Simulation results for metastability-containing sorting networks with n ∈ {4, 7, 10} for B-bit inputs. 10-sort # optimizes gate count [7], 10-sort d optimizes depth [6]; for n ∈ {4, 7}, the sorting networks are optimal w.r.t. both measures. Simulation results are: (i) number of gates, (ii) postlayout area [µm 2 ] and (iii) prelayout delay [ps].
performance would provide valid input for sorting circuits that accept inputs encoded in this manner.
We believe that these results suggest applicability of our techniques to a wide range of mixed-signal control loops and call for future work further exploring to which extend basic arithmetics can be realized by efficient MC circuits.