Self-Repairing Hybrid Adder With Hot-Standby Topology Using Fault-Localization

Effective self-repairing can be achieved if the fault along with its exact location can be determined. In this paper, a self-repairing hybrid adder is proposed with fault localization. It uses the advantages of ripple carry adder and carry-select adder to reduce the delay and area overhead. The proposed adder reduces the transistor count by 115% to 76.76% as compared to the existing self-checking carry-select adders. Moreover, the proposed design can detect and localize multiple faults. The fault-recovery is achieved by using the hot-standby approach in which the faulty module is replaced by a functioning module at run-time. In case of 3 consecutive faults, the probability of fault recovery has been found to be 96.1% for a 64-bit adder with 8 blocks, where each block has 9 full adders.


I. INTRODUCTION
The possibility of single-event-upset (SEU) in digital systems has risen as a result of the increase of on-chip system complexity as well as reduced clock cycles [1], [2]. The presence of radiation and other environmental conditions further enhance the probability of SEU [3], [4]. To handle SEU, the concept of ''totally self-checking'' was introduced. A system is characterized as totally self-checking if it remains unaffected by a fault, or produces a non-coded output for every generated fault [5]. In addition to fault detection, fault recovery should also be considered to ensure hardware reliability [6]. This is why the concept of built-in self-repair is becoming increasingly pertinent to current digital systems [7]. Fault recovery however becomes challenging in an inter-connect hardware design because of fault propagation. Therefore, fault localization becomes necessary for such type of hardware design.
Adder is an essential element present in almost all digital systems thus the introduction of built-in self-repair in adders can play a vital role in digital designs [8], [9]. Moreover, the presence of carry propagation chain makes adder an ideal case to understand the phenomenon of handling faults between inter-connected modules. To achieve this, both fault detection and localization should be performed.
The associate editor coordinating the review of this manuscript and approving it for publication was Heng Wang .
Ripple carry adder (RCA) and carry select adder (CSeA) are among the most commonly adopted adder topologies, hence many of the reported reliable adders are based on these topologies. In [9], a self-checking CSeA using 2-pair-2-rail checker encoding approach was proposed. It uses the advantage of the parallel rail of RCA present in CSeA, where each RCA rail produces an output for one of the initial carry inputs, i.e. C in = 0 or 1. Outputs of the two parallel blocks were compared to detect the presence of fault. This design is only valid for 2-bits and later an improved n-bit CSeA design was proposed in [10]. The relationship between the two parallel rails of RCA in [9] was further utilized in [11] to design a single RCA-Based self-checking CSeA with fault localization. The reported RCA performs addition for C in = 0 and the resulting sum-bits are used to generate the sum-bits for C in = 1. The design in [11] requires 12% less transistor count than the self-checking CSeA in [9]. In [12], a self-checking CSeA is proposed using parity prediction approach in which the operands are provided to the adder along with their respective parities. It however cannot perform fault localization and has limited fault coverage, because it can only indicate fault if it occurs in odd number of bits. Although these approaches were shown to be effective for SEUs, they can not perform fault recovery with minimum area overhead.
To boost the reliability of adders, the most conventional method is known as triple modular redundancy (TMR), which involves two redundant modules employed to produce VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ additional outputs, and the final output is selected via a voter circuit [13], [14]. However, the fault propagation phenomenon may cause a common mode failure which cannot be handled by the TMR. To address this issue, a shifted operand approach is used in [15] for TMR based self-checking ALU design. A similar concept of shifted and rotated operands is also used in [16] to minimize the required diversity in ALU architecture for self-checking TMR design. Another problem of TMR is its large area which is at least twice of a normal design. To overcome this limitation, a partial adoption of TMR is utilized in [17]. In this approach, only the most significant bits (MSB) block will be triplicated, which however increases the possibility of system failure to more than 50%. The concept of TMR despite its simplicity, cannot be applied in systems where area is the major concern. Some non-conventional techniques have also been adopted for reliable adder designs. One of such techniques is the self-repairing signed digit adder proposed in [18], in which fault localization is achieved because of the limited propagation of carry chains to the neighboring adder block. It uses both hardware and time redundancy for self-checking and self-repair. However, this design can only provide fault detection when odd number of bits are faulty, and it is sensitive to the parity predictor and error indicator. In [19] a self-repairing conditional sum adder (CoSA) with single spare hot-standby approach is presented. However, it only provides self-checking and repairing for the conditional selection cell module which is the building block of CoSA. This limited fault coverage approach makes the design less robust. These techniques provide less area overhead than TMR but the fault coverage is limited and also fault recovery is not always possible.
In this paper, a self-checking and repairing hybrid adder (HA) design with reduced area and time overhead is proposed. The proposed adder utilizes the advantage of the low complexity of RCA and the high speed of CSeA. Fault-detection and localization are realized by using a self-checking full-adder (SFA), in which fault detection is independent of the propagated carry. To minimize the area, a single RCA based CSeA approach is adopted together with a hardware friendly implementation using pass transistor logic. Moreover, square-root topology is used to reduce the delay in the proposed design. The proposed self-checking HA with fault localization and multiple fault detection feature, requires on average 50% more transistor count as compared to traditional CSeA design. A distributed fault-recovery mechanism using hot-standby approach is further proposed to reduce the probability of system failure.
The remainder of this paper is organized as follows. Section II describes the proposed self-repairing HA design. Comparative analysis with previous approaches is presented in Section III. Finally, concluding remarks are presented in Section IV.

II. PROPOSED SELF-REPAIRING HYBRID ADDER DESIGN
The self-repairing HA design is proposed by considering the area overhead, delay and the fault coverage.

A. HYBRID-ADDER DESIGN
The time required for CSeA to compute the lowest bits is more than the required time for RCA. This additional delay is caused by the MUX. Therefore, if a simple RCA for initial bits is employed, the design will be more efficient in terms of hardware and time-delay. The complexity will also be reduced with the use of RCA as the beginning block. This is why in the proposed HA design, the least significant bits are computed using RCA, while a single RCA-based CSeA is used for computing the higher bits, as shown in Fig. 1(a) and (b). In addition to this, the proposed HA design follows the square-root topology because a linear CSeA design has similar time delay as that of simple RCA. Therefore, sub-linear delay approach has been considered to balance the delay path by diving the adder in to blocks where the size of the block increases linearly from m, m + 1, . . . , m + l.
It should be noted that RCA-Block (RBL) is the fundamental building block of RCA, shown in Fig. 1(a), whereas, the CSeA constitutes of two fundamental blocks that is the initial block (INL) and the Adder Block (ABL), as shown in Fig. 1(b). The reason for having two fundamental block for constructing CSeA is because of the basic principle of single RCA based CSeA design which states that: Except for the least significant bit which are always complement to each other, the Sum bit computed for complement value of initial C in will also be complement to each other if all the lower Sum bits are equal to logic 1.
The initial block (INL) is therefore responsible for generating the least significant Sum bit by taking the complement of the Sum bits generated at initial C in = 0. All the other Sum bits will be generated by using the Adder Block (ABL) in which the AND gate is used to determine the status of the previous SUM bits computed for C in = 0 while the XOR gate generates the corresponding SUM bit for C in = 1 by considering the status of the previous SUM bits. The number of ABL used for designing CSeA block is equal to the (size_of _the_CSeA_block − 1).
In Fig. 1(b), the partial Sum and C out bit is represented by S j i and C j i , respectively, where j indicates the initial C in and i indicates the bit number. The fault is indicated by the error signal E f . The final C out will be generated by using the Module of Final C out (MOFC). The C out generated by the MOFC after each CSeA block will be treated as an actual C in for the next CSeA Block. Whereas, the C out of RCA block is used as an actual C in for the first CSeA block.

B. FAULT DETECTION AND LOCALIZATION
Fault localization is achieved by using the approach of selfchecking, independent of the propagated carry. In [11], a selfchecking full adder was presented which can detect a fault based on its internal functionality and is independent of the propagated carry. The relationship between input and output bits of full adder was utilized for self-checking. Consider a full adder with inputs A, B, C in , and the outputs Sum, C out , as shown in Fig. 1(c). The fault will not be indicated until Property 1 remains valid for that full adder: It can be observed from Fig. 1(c), that the self-checking full adder can be designed with the expense of an extra Equivalence Tester (E qt ) bit, which is required to indicate the relationship of the input bits. The E qt will be equal to 1 if all input bits are equal and vice versa. Hence, the following three equations from Eq. (1) to (3) need to be implemented for designing a self-checking and fault localized adder.
Since the goal of this design is to reduce the area overhead without compromising the reliability, Equations Eq. (1) to (3) which are used for designing a self-checking and fault localized full-adder need to be implemented with minimum transistor count. A high speed and area efficient full adder design is found in [20]. However, this approach cannot be adopted completely because of the logic sharing between Sum and C out , due to which the probability of common mode failure increases. Therefore, the equation and transistor level implementation of C out has only been adopted from their design.
The final implementation of Eq. (1) to (3) using pass transistor-based approach is shown in Fig. 2.

C. SELF-REPAIRING APPROACH
A hot-standby approach has been adopted for fault recovery. In this approach, if the fault is detected in any of the full adders, the generated error signal will shift the input bits such that the faulty adder will not be used for computation. The main challenge in doing this shift operation is the carry chain which is linked between each consecutive full adder, and the X i bit which is indicating the status of all previous Sum bits in each CSeA block, as shown in Fig. 1(b). The problem of carry chain has been resolved by making C out to be dependent on error signal E f of the SFA. In case of fault, the C out (i.e. C i ) will be equal to C in (i.e. C i−1 ). Since, X i indicates the status of all previous Sum bits computed for initial C in = 0, if any previous Sum bit is equal to 0 then X i will be 0, else it will be 1.The Sum bit of each ABL is dependent on the previous value of X i , therefore in case of fault detection the value of X i should not be updated for the next ABL block. In order to achieve this, the Error signal (E f ) has been used to replace the SUM bit in case of fault detected, because X i is produce through an AND gate and if the current Sum bit value is set to logic 1 then the previous value of X i (i.e. X i−1 ) will be propagated.
Note that X i is only propagated to the ABLs present in each CSeA block along with the next MOFC block, and it will not be propagated to the next CSeA block because each CSeA block is independent of the previous block. In order to accommodate all these changes, the fundamental blocks for extending the CSeA to an n-bit CSeA that is ABL and INL in Fig. 1(b), has been modified, as shown in Fig. 3(a) and (b), respectively.
Since the carry-chain exists in RCA block as well, the fundamental block of RCA (i.e. RBL) has also been modified, as shown in Fig. 3(c). However, the OR gate present in the modified RBL is not applicable for the first full adder of RCA because of the absence of any previous error signal. The final SUM bit generated by the adder also needs to be shifted in order to accommodate the shifted operands. Therefore, additional multiplexers have been used to perform the shift operation for SUM bits, as shown in Fig. 3(d).
The self-repairing part is only limited to the number of spare SFA. However, the self-checking property of adder block remains active even if the fault recovery is not possible, which illustrates that, after replacement if any SFA gets faulty then the fault will be indicated but cannot be handled. In order to improve the rate of recovery for larger adder size of more than 8-bit, the number of spares needs to be increased such that each block has one spare module, which means that each block can handle single fault recovery at time. The reason of keeping a single spare in each block is because the probability of having multiple faults in smaller blocks is less than the larger blocks. To illustrate this idea, let n − bit adder is divided in to N blocks with each block have t full adders. Let r random faults be introduced to the system, then the probability of having x faults in a same block without replacement can be computed by Eq. (4). where; the range of x will be equal to 0 < x <= t. However, the system will not be able to recover the fault in a block if x > 1. The recovery will still remain possible in all other blocks where x < 1. Therefore, the probability of the system failure when every single block gets more than 2 faults is given by Eq. (5). The probability of fault recovery in a block can be computed by Eq. (6).
The probability of fault recovery if 2 out of 3 faults occurred in a single block of the adder is shown in Table. 1. To analyze the impact of block size on fault recovery, three different size of adders are considered such that each adder is constructed using two different block sizes. It can be observed from Table. 1 that the number of full adders in each block decreased with the increase in number of blocks. Whereas, the number of spare modules increased with number of blocks because each block has single spare module for recovery. The overall size of adder can be determined by Eq. 7. It can also be observed from Table.1 that increasing the number of full adders in a block will increase the probability of failure of that block. For example, a 32-bit adder can be built by using 2 blocks and 4 blocks, each of which have 17 and 9 full-adders, respectively. It can easily be observed that as the number of blocks increases from 2 to 4 the probability of block failure decreases from 38.6% to 13.6%. However, the area-overhead of adder will also increase with the increase in number of blocks.

III. RESULTS AND BENCHMARK
The proposed design with self-checking property is compared in terms of area overhead and fault coverage with the reported self-checking CSeA in [10] and [12]. Also, the proposed self-repairing HA is compared with self-repairing CoSA approach [19] and reduced precision TMR [17]. The transistor counts of each module for self-checking and self-repairing HA is presented in Table. 2. It should be noted that the Sum and Carry bypass modules are required for self-repairing design, therefore, they are not considered while comparing the area-overhead of self-checking design. The required number of logic gates and other modules along with the total transistor overhead is shown in Table. 3, where n is the size of adder, m is the size of RCA block and k is equal to the total number of CSeA blocks used in the design. The value of k varies with adder size and in this work, the value of k has been selected to be 1, 2, 3, 5 and 8 for 4-, 8-, 16-, 32-and 64-bit adder, respectively. In standard CSeA design without self-checking, the transistor count for full adder and MUX has been reduced to 12 and 4 respectively, because the Eqt and checker module are not required.

A. COMPARISON WITH SELF-CHECKING CSeA
The area overhead of the proposed self-checking HA without recovery is compared with the previously reported self-checking CSeA design. It should be noted that a uniform complementary pass transistor logic design approach has been adopted while comparing the transistor counts, such that an inverter has been used after every stage of pass transistor. The implementation of sub modules with and without self-checking is shown in Fig. 4.
It can be observed from Table. 4 that the proposed design requires on average 50% more transistor count as compared to the standard CSeA design without self-checking. Whereas, the required number of transistors are reduced by 76.76% and 115% as compared to [12] and [10], respectively. It should be noted that the proposed approach also requires 68.68% less transistor count as compared to our previous proposed self-checking CSeA [11]. The increase in transistor count for different adder sizes has been shown in Fig. 5. It can be observed that our proposed approach shows least overhead as compared to the previous approaches.
In addition to the reduced area overhead, the proposed design possesses fault localization property and can detect multiple faults, with the condition that a single module should not have multiple faults, while [10] can only detect single fault at a time and [12] can only detect faults in odd number of bits without fault localization. In addition to the problem associated with odd number of erroneous bits, the approach  in [12] is not totally self-checking because of the presence of logic sharing between the SUM and the propagated carry block. Any fault in the shared logic will easily get masked and cannot be detected with the parity prediction approach.
The power estimation is done using Cadence tool and it has been found that the traditional 32-bit CSeA design requires 4.51 mw power which increased to 6.90 mw for our proposed self-checking HA. Hence, 52.9% power consumption has been increased by using our proposed design. The delay for computing the final C out using traditional CSeA design and the proposed self-checking CSeA has been shown in Eq. (8) and (9), respectively, where h is the number of full adders in the final CSeA block. In can be observed that delay has been increased by a factor of only two logic gates.

B. COMPARISON WITH SELF-REPAIRING APPROACHES
The proposed self-repairing adder with single spare module required an average of 186% area overhead as compared to traditional CSeA without self-checking as shown in Table. 4. The power consumption for 32-bit HA design has also been increased by 184.2% as compared to traditional CSeA. In terms of time overhead, the latency can be observed by Eq. (10). The overhead is mainly caused by the MUXs controlling the carry propagation chain.
The proposed self-repairing HA design is compared with the previously reported self-repairing CoSA [19] and reduced precision redundancy adders (RPRA) [17]. It should be noted that both CoSA and RPRA approaches consider graceful degradation in which some portions of the circuitry have been considered for fault detection and recovery. In case of CoSA, the design cannot detect fault during the actual addition operation. Moreover, only single conditional selection cell (CSC) module can be tested at a time with a given test pattern. The self-checking property has not been considered for modules other than CSC like chain of MUXs, shift registers etc. Furthermore, the self-repairing process is also expensive because the whole CSC module which is responsible for 2-bit addition, has to be replaced with the spare one. In addition to this problem the designed CoSA will not have fault diagnosis ability, if there is no further spare module available.
The RPRA [17] approach on the other hand can only correct the error in the MSB, while the Least Significant Bits (LSB) is fed directly to the output. Hence, fault in both LSB and the voter for MSB, is undetectable. Also, the fault propagated through LSB to MSB via C out can not be detected.
In contrast to the previous approaches, the proposed design can perform run-time fault detection during actual addition process and also can detect multiple-faults at a time with the condition that each module should not have more than one fault. The fault recovery is dependent on the number of spare modules but the self-checking property of the design remain valid whether the recovery is possible or not.

IV. CONCLUSION
A self-checking and repairing HA design has been presented with reduced area overhead and increased fault coverage as compared to the previously presented design approaches. The HA design follows the architecture of single RCA based CSeA with the only difference of initial bits, which has been computed using RCA. A run-time self-repairing approach has been adopted by using hot-standby topology. The proposed design can be extended easily to any size by using fundamental block design presented in the paper.
The proposed design with self-checking has been compared with the previously reported self-checking CeSA in terms of area and fault coverage. It has been observed that the proposed self-checking HA approach with the delay overhead of only two logic gates, requires 50% more transistors as compared to the traditional CSeA without selfchecking. Whereas, the required overhead is 76.76% and 115% less than the previously proposed self-checking CSeA approaches. Moreover, due to the distributed self-checking mechanism, the proposed approach can detect and localized multiple faults with the condition that a single module should have single fault at a time.
A hot-standby approach has been adopted for fault recovery. The area overhead has been increase to 186% as compared to standard CSeA approach. However, the probability of recovering multiple faults has been increase as compared to the previous self-repairing CSeA approaches. A 64-bit adder with 8 equally sized blocks can handle 3 consecutive faults with 96.1% probability with the condition that each block have single spare module. It should be noted that the self-checking property remained valid irrespective of the possible recovery which was not possible in previous approaches.