A Thermal-Aware On-Line Fault Tolerance Method for TSV Lifetime Reliability in 3D-NoC Systems

Through-Silicon-Via (TSV) based 3D Integrated Circuits (3D-IC) are one of the most advanced architectures by providing low power consumption, shorter wire length and smaller footprint. However, 3D-ICs confront lifetime reliability due to high operating temperature and interconnect reliability, especially the Through-Silicon-Via (TSV), which can significantly affect the accuracy of the applications. In this paper, we present an online method that supports the detection and correction of lifetime TSV failures, named IaSiG. By reusing the conventional recovery method and analyzing the output syndromes, IaSiG can determine and correct the defective TSVs. Results show that within a group, <inline-formula> <tex-math notation="LaTeX">$R$ </tex-math></inline-formula> redundant TSVs can fully localize and correct <inline-formula> <tex-math notation="LaTeX">$R$ </tex-math></inline-formula> defects and support the detection of <inline-formula> <tex-math notation="LaTeX">$R+1$ </tex-math></inline-formula> defects. Moreover, by using <inline-formula> <tex-math notation="LaTeX">$G$ </tex-math></inline-formula> groups, it can localize up to <inline-formula> <tex-math notation="LaTeX">$G\times R$ </tex-math></inline-formula> and detect up to <inline-formula> <tex-math notation="LaTeX">$G\times (R+1)$ </tex-math></inline-formula> defects. An implementation of IaSiG for 32-bit data in eight groups and two redundancies has a worst-case execution time (WCET) of 5,152 cycles while supporting at most 16 defective TSVs (50% localization). By integrating IaSiG onto a 3D Network-on-Chip, we also perform a grid-search based empirical method to insert suitable numbers of redundancies into TSV groups. The empirical method takes the operating temperature as the factor of accelerated fault due to the fact that temperature is one of the major issues of 3D-ICs. The results show that the proposed method can reduce the number of redundancies from the uniform method while still maintaining the required Mean Time to Failure.


I. INTRODUCTION
Serving as vertical wires between two adjacent layers in Three Dimensional Integrated Circuits (3D-ICs), Through-Silicon-Vias (TSVs) offer extremely short lengths and low latency, which could bring high speeds and lower power to inter-chip communication [1]- [3]. TSV-based 3D-ICs also have smaller footprints despite the TSV's overheads [4] due to the three dimensional structure.
However, one of the major concerns of TSVs is reliability due to their low yield rates [5], vulnerability to thermal and The associate editor coordinating the review of this manuscript and approving it for publication was Fanbiao Li . stress, and the crosstalk issues of parallel TSVs [6], [7]. A single defective TSV in the manufacturing phase can corrupt the connection between two layers or even separate the system in parts. Therefore, identifying and correcting the faulty TSVs is necessary to improve the overall yield rate. On the other hand, by having higher operating temperature and high temperature differences between layers [8], the thermal and stress impacts on 3D-ICs reliability are also critical, which can shorten the lifetime expectation. Consequently, there is an imperative need to not only improve the yield rate at the manufacturing phase but also on lifetime reliability due to the vulnerability to the thermal and stress of TSVs. To solve these issues, researchers have been focusing on dealing with TSV faults in several phases, such as pre-bond, post-bond, and post-manufacture and different aspects: detection, recovery, online, or offline.
To localize the defective position, most systems use Builtin-self-test (BIST) [9], external testing [10], [11] and online testing [12], [13]. The mentioned works mostly focused on issuing test patterns and capturing them on a different terminal of a TSV to identify their healthiness. To tolerate defects, three main approaches: (i) hardware fault-tolerance circuits [14], redundancies [15]- [17], or reliability mapping [7], [18]; (ii) information redundancy such as coding techniques [19], [20] or modular redundancies [21]; or (iii) algorithm-based fault-tolerance [22], [23]. While the hardware fault-tolerance circuit [14] uses a specific design to analyze the output of a TSV to identify its status and correct (i.e., raising or lowering the output voltage), the redundancy-based approaches [15]- [17] use spare TSVs to handle the tasks of the faulty TSVs. On the other hand, the reliability mapping approach tried to analyze the potential critical issues and optimize the design for a certain requirement. The information redundancy methods deal with the defective TSVs as flipped bits where they use coding techniques to detect and correct the corruptions in transmitted bits. As we can observe that both hardware and information redundancies keep the faulty TSV group and provide a correction method; however, there are cases where the fault rates are higher than the limitation of these approaches. Therefore, algorithm-based fault-tolerance [22], [23] methods, which provide an alternative way to execute the system, can be useful. This type of approach can help the system maintain its reliability under a reasonable tradeoff on performance. Although commercial CAD tools and existing solutions have become mature for defect localization and detection, having an online, non-blocking, and low-cost solution helps preventing expensive consequences of operating systems under faults.
Despite having numerous methods to solve the reliability issues of TSVs, they mostly focus on offline testing and recovery in three phases: pre-bond, post-bond and postproduction. However, the high operating temperature is one of the critical issues of 3D-IC [8]. The fault rates are expected to exponentially accelerate with the operating temperatures in most academic and industry models [24], [25]. TSVs made out of Copper also have higher activation energy than general CMOS, which makes their fault rate further accelerated with the temperature. By having higher temperatures than conventional 2-D ICs and high temperature differences between layers, life-time reliability is one of the critical issues for 3D ICs.
From the design perspective, having a light-weight and graceful-degradation localization and recovery method is necessary for TSV-based 3D-ICs. However, there are some issues that motivate our method to solve: 1) To ensure the real-timeliness, fault detection and recovery need to be completed on a timely basis with acceptable execution time extension [26]. In other words, the system must respond to new faults (as an event) before a certain deadline. The existing works using BIST [9] or external testing [10], [11] periodically can offer a high coverage; however, they cannot satisfy these requirements as they need an enormous amount of cycles to complete and can also block the communication, which degrades the overall performance. 2) A possible approach that can help the system to operate in real-time while ensuring the quality of connections is to use ECCs [19]. However, ECCs are usually limited by the number of detectable and correctable faults, which is inefficient to clustering defects [15] in TSVs.
3) The integration of recovery is not well considered in most TSV localization works, especially for online and lifetime reliability. Most works deal with defect localization and recovery separately. Meanwhile, safetycritical systems require high availability, which needs self-correction on the fly. 4) While adding redundancies [16], [17] for manufacturing defects considers the fault-rate as a uniform distribution, lifetime reliability of TSVs is heavily affected by the operating temperature [24], [25]. Since the temperature is also shown with a non-uniform distribution in 3D-ICs [8], designers must consider how we group TSVs to improve system reliability. Motivated by the above problems, TSV-based 3D-ICs desire to have a method to monitor, detect, localize, and recover from TSV defects. In [27], an on-communication test (OCT) method was previously presented by testing the communication medium along with its operations. The main idea in the aforementioned work is to use Error Correction/Detection Code as the baseline and improve its detectability and correctability using augmented algorithms. On-communication test ing only offers non degradation; but, also short response time for TSV. Compared to periodic test [15], OCT executes and finishes after a new fault occurred, which guarantees the real-time requirements. In this paper, we propose a novel method named Isolation and Shift in Group (IaSiG), which is not only an OCT method, it also provides recovery right after its execution. This work is based on our preliminary work in [27] 1 with the additional new contributions as follows: • A low response time on-communication test (OCT) algorithm to isolate and check the possible defects in a group of TSVs. Double-check mechanism is also integrated to reduce the probability of hidden defects.
• The proposed OCT reuses the unused spare TSVs for the testing purpose. Therefore, the proposal utilizes all the redundancies for both detection and recovery.
• A grouping algorithm to help reduce the test time and increase the detectability and localizability of the isolation and check algorithm.
• Integration of the proposed method onto a 3D-Networkon-Chip (3D-NoC) and performance evaluation. An empirical redundancy insertion also helps to reduce the number of redundancies while maintaining the required reliability.
In summary, this paper aims to provide an online and non-blocking method for detecting and correcting failures in TSVs. We also investigate the impact of operating temperatures and adapt the number of redundancies to that. Along with mathematical analyses, this paper provides a comprehensive platform for dealing with the lifetime reliability of TSVs and could be widely used. We also demonstrate the efficiency of the proposed methodology using a 3D-NoC under PARSEC benchmarks. The organization of this paper is as follows: Section II presents the related work. In Section III, we introduce the proposal and Section IV is dedicated to illustrate the evaluation experiments and findings. Finally, Section V concludes the paper.

II. RELATED WORKS
In this section, we present the related works on ECC methods, TSV testing, TSV recovery and scheduling approach.
Since parity calculation requires only XOR gates, it has been the backbone of most low-cost error detection and correction codes. Hamming code and its extension SECDED (Single Error Correction Double Error Detection) [19] are also two common methods based on parity-check. The delay and area complexity of these methods are only O(n) and O(log 2 n), which make them more suitable for the encoding and decoding scheme for high-speed TSV links. To correct more faults, Orthogonal Latin Square Code is also another option for TSV with low cost and modular design [28]. In [29], the authors presented a method named SEC-DAEC (Single Error Correction Double Adjacent Error Correction) to correct not only a single flipped bit but also two adjacent flipped bits.
In this proposal, we adopt the parity check, which can detect one fault. Therefore, the proposed technique can be integrated into any ECC scheme that is based on the parity check.
To correct faulty TSVs, there are two major approaches: (1) correction circuits [14] such as using voltage comparator to detect and correct open defects (2) double [30] or shared spare TSVs [16], [17] to replace the faulty ones. While correction circuit is low cost, it is limited in terms of correctability. Therefore, recent researches tent to focus on spare TSVs instead. There are four major methods of shared spare TSV recovery: (a) shifting [31], (b) switching [16], (c) crossbar [17] and (d) network [15]. To redirect the TSV signal to the spare ones, a multiplexer/de-multiplexer [16], [17], [31] or tri-state gate [23] could be used. Double-TSV [30] is another method using redundancy; but, it is not cost efficient. In [32], the authors approached the redundancy placement using cobweb-like shapes. The redundancies are placed at the border of the cobweb and the signals of failed TSVs are shifted outwardly in chains. Another shape of the TSV group could be a honeycomb [33] that provides lower effect on area than conventional 2D Mesh while still maintaining the scalability feature. By combining a single or several honeycombs shaped groups of TSVs with the time-division multiplexing technique, the approach in [33] can even lower the cost of the design. Although alternating different shapes of TSV groups can end up with a better result, the 2D-Mesh-like placement still fits best the ASIC design. Park et al. [34] introduce the TSV set structure and redundancy re-usage to not only detect and correct TSV faults, but can deal with the intervention of soft errors during online testing/repairing. One thing we can easily notice is most of the above methods provide detection/recovery using spare TSVs regardless of the TSV position or the repairing structure. By considering them as the optimization problem, work in [18] use Integer Linear Programming to find the optimal value for spare and functional TSV positions and the group structure.
The built-in Error Detection/Correction Code [19] (EDC/ECC) could help detect and locate faults in TSVs as normal wires. However, in [35], the inconsistent behavior between TSVs and wires was discussed unveiling that the flipped bit is not consistent in TSVs. On the other hand, despite the immediate response time, ECC/DEC can localize and detect a certain number of defects. Also, additional bits require extra TSVs, which are costly since the size of TSV is significant.
Another common method is to use testing circuits or BISTs. Works in [11], [12], [14] depicted a more fine-grained method, which could detect open defects using a simple circuit. Depending on the level of defects, they can even provide the recovery method. The grouping method allowing column and row check is presented in [36]. This method supports open, short and bridge defects and could reduce the testing time from O(n) to O( √ n). For online testing, injected test patterns to the TSV can be captured at the output and analyzed to find open defects using a NAND gate with logic threshold voltage [12]. Serafy et al. [13] presented a lifetime reliability using a resistance tracking method and BIST to overcome the aging in TSVs. In [17], the authors proposed a test access point for injecting and collecting test vectors while Van der Plas et al. [6] used a test pattern generator to test open TSV defects. In [9], other methods of TSV's BIST for pin-hole and void defects are also presented.
One thing in common between these types of tests (BISTs and dedicated circuits) is that they require system interruption (partly or totally) when they detach tested devices/modules from the system. This is not affordable in real-time and safety-critical applications, as we previously discussed.

A. TEST SCHEDULING
Since naively allowing to run BISTs is not preferable because of the costly execution they demand, scheduling this type of tests based on period is more suitable. This method is called Periodic BIST (P-BIST) that activates BIST periodically. Here, we focus mainly on Network-on-Chip testing as the scope of this work. Works in [37] and [38] activates periodically their testing circuit; however, they only execute during the free time slots to avoid conflicts between data traffic and test traffic. The tested routers still maintain their functions as usual. For the detached core, they also provide alternative connections during the test time. Huang et al. [39] also presented another non-blocking testing for Network-on-Chips. A testing for NoC fabrics is presented in [40], which can be applied for 3D-NoCs, uses dedicated test data and structure. The consensus of these methods is to smartly schedule in order to avoid creating congestion/conflict on the system to reduce the performance degradation. Because their experiments are limited in terms of size, more complicated systems might lead to costly degradation.

III. PROPOSED ARCHITECTURE AND ALGORITHM
This section first shows the TSV organization. Then, we overview the Isolation and Shift (IaS) algorithm, which we enhance to a group-based method named Isolation and Shift in Group (IaSiG). Furthermore, we provide mathematical analyses for the proposal. Later, we present the integration of the method onto the 3D Network-on-Chip to support detection and correction faults. In this work, we consider the online monitoring and correcting for TSVs. We assume that the manufacturing test has already been performed and the system can correct TSVs if needed.

A. ISOLATION AND SHIFT (IaS) overview
This section provides an overview about Isolation and Shift (IaS), which is our preliminary work in [27]. We later discuss the advantages and drawbacks of the IaS technique.

1) TSV ORGANIZATION
In this work, we adopt the TSVs as one-dimensional arrays and the shifting mechanism for recovery [31]. We also note that our technique is general and can be applied for others organization and recovery methods. Here, we assume a TSV group of M original TSVs and R spare TSVs. We symbolized each TSV as t i where i is the index of the TSV. Also, the input signals for TSVs are organized as a similar one-dimension array and symbolized as s i .
For a set of M bits data, we use parity to check correctness of the data. This leads to M + 1 bit codeword and the TSV set now has a size of M + R + 1. Depending on the ECC/EDC, the number of TSVs may vary.

2) TSV MODEL
Despite having parity check, there is no guarantee that we can detect odd number of defects. The main reason is the inconsistent behavior of defective TSVs [35]. For instance, with short-to-substrate defects, the defect in a TSV could be hidden when it transmits value '0', which is at 0V . Here, we adopt TSV defect models from [35] as follows: • Short-to-substrate: the value of TSV is stuck at '0'.
• Open: a certain latency, which is added to slow down the transition of TSV, delays the value of TSV by one clock cycle.
• Bridge: Two or more TSVs are shorted, which prevents them from having different values. If all of the bridged TSVs send the same logic value, the output is corrected. However, if they are different, the output follows the majority voting. If the numbers of '0' and '1' are equal, the output is randomly assigned as a post-metastability result.

3) MECHANISM
Isolation and Shift is based on two phases. The isolation phase is performed by considering the isolated TSV as unusable.
Then the system shifts its signal to utilize the spare TSV to support communication. For instance, if the TSV with index f is isolated, signals s i≥f are routed to TSVs t i+1 . By isolating each TSV in the group, the decoder can remove each TSV from the communication. It then can decide whether TSVs have defects based on the syndrome of Paritycheck. If the system finds out that isolating t f gives non-faulty output while using the actual t f TSV gives faulty outputs, t f is determined as faulty. Let S (f =i) be the output (syndrome) of the decoder while isolating t i : The detection process is based on statistics, which collect the faulty behavior (failed the parity check) for a certain number of cycles. Because the TSV defect is inconsistent, it can take several cycles to detect the faulty cases. We define a healthy set of TSVs is to have less than T faulty outputs after K transmissions (K > 2). To distinguish defects from soft errors, the threshold T could be set to 1; otherwise, T = 0. Here, we define the syndrome S as: A demonstration of multiple faults detection is shown in Fig 1. This illustration of using shifting can help to detect and correct two faults by using Parity-check. The proof of using Parity-check to detect multiple faults is shown in Lemma 1 at the Section III-D.    Table 1 shows the Monte-Carlo simulation with 10,000 cases to verify our solution with the TSV model. We can observe that with K smaller than 32, there is a certain number of hidden defects. However, under the random data case, it can detect with K = 32. However, as analyzed in Lemma 1, there is still a chance of having hidden defects. Therefore, in Algorithm 1, we use double-check to improve the correctness of the decision.

4) ALGORITHM
Algorithm 1 shows the IaS (Isolation and Shift) algorithm for detecting and localizing faults in a TSV group. It starts without isolating any TSVs and keeps calculating the parity check of incoming flits. Once an error occurs, the system starts isolating and checking. Without that, it keeps checking until meeting errors (see line 1-2 of Algorithm 1). Because of this, redundant TSVs are not used frequently, which could avoid aging in those TSVs. Note that we assume that the redundant TSVs are always healthy.
In comparison to the previous work [27], the major improvement of this algorithm is to use double-check. As shown in Table 1 and Lemma 1, there is a chance of hidden defects. To avoid this scenario, we use double check to naturally increase the value of K. When IaS detects the potential case, it runs one additional K transmission instead of concluding immediately.
Once an error is hit, the system isolates one TSV and uses redundant ones to handle the communication. The parity check will find whether there is a faulty output. If the nonisolated is faulty (S (f =NULL) = 1) and the isolated case is non-faulty (S (f =i) = 0), the system indicates the faulty positions are the isolated ones. To avoid silent errors, the system performs double check: once a satisfied case is met after K transmissions, it keeps checking for extra K (double-check) transmissions to ensure the correctness (lines [6][7][8][9][10][11][12]. Note that the system scans through all TSVs in order to isolate them. At the end of the scan, the system checks whether it finds out the faulty positions. If not, it indicates that more than R defects occurred and marks the whole group of TSVs as faulty.
To support real-time applications, fully hardware architecture is used and no connections are blocked during testing.  In terms of checking time, the proposal needs transmissions (cycles) to complete its loops. This is also the maximum response time to newly occurred faults or the worst-case-execution-time (WCET) of the algorithm. In [31], the authors proposed testing TSVs by transmitting two values '0' and '1'. A test generator is also used in [12], [17]. Here, once TSVs are isolated, they could be tested using a dedicated tester. Figure 2 shows the simplified architecture for IaS with [M = 4, R = 1] TSVs. The input data width is M − 1 and the encoded data width is M for parity check. Note that the system can be adopted with another ECC, which has a different coding rate. The encoded data is shifted with the configuration from ''TSV-Fuse'', as represented in Figure 2. This box receives the isolated TSV value (isol_TSV ) from the controller.

5) ARCHITECTURE
At the bottom layer, the output of data from TSVs is unshifted using a corresponding configuration from ''TSV-Fuse''. The unshifted data is checked with ECC to find possible data corruption. The parity check is sent to the controller to monitor the case. After looping all cases, the controller decides what is the proper configuration, which also means the faulty indexes.
To communicate between two layers, we use a synchronization TSV (s_TSV ) for matching the operations of the two controllers. The data is transferred via functional TSVs (f_TSV) with the help of redundant TSVs (r_TSV). Note that, in order to perform our algorithm, s_TSV must be healthy. To protect it, several methods could be used such as majority voting (3 TSVs for voting) and Double-TSVs. Note that our previous work [27] does not provides double-checking feature; therefore, s_TSV is not needed.

6) DISCUSSION
Despite having a certain WCET and providing non-blocking testing, IaS still has two major drawbacks: 1) The testing time is not scaling well with IaS. With higher M and R values, the testing time can become enormous increasing the WCET. For instance, with M = 32 and R = 2, the WCET of IaS is over 17,000 cycles (see Figure 4). 2) With large M values there are more than R defects in a group, marking the whole group not efficient for both detection and recovery. Here, providing a certain set of defective TSVs can offer more efficiency.
In order to solve the above two issues, we present hereafter the group-based test, named IaSiG.

B. PROPOSED ISOLATION AND SHIFT IN GROUP
This section presents Isolation and Shift in Group (IaSiG) algorithm (see Algorithm 2) that works based on IaS. Assuming the system has M functional TSVs and R spare TSVs.
Note that we can even use heterogeneous clustering (groups with different C i values). The current testing cluster has access to R spare TSVs and IaSiG uses IaS to test each group.
Note that by dividing into groups, each IaS of a group can locate R and detect more than R defects in a group. Therefore, IaSiG can locate G * R and detect more than G * R defects. If a whole group is defected, IaSiG could mark it as faulty while still considering other groups as healthy. This solves the second issue of IaS. As shown in Algorithm 2, the system first runs parity check for the whole TSVs (M TSVs) until faults are detected (S == 1). Then, it runs a oneby-one clustering check by attaching R redundant TSVs to C i functional TSVs. Once a clustering test is hit with a fault (S == 1), Isolation and Shift (IaS) is used to find the where: • case-1: There are at R+ defects in every group, which makes the Algorithm 2 jumps into group every time.
• case-2: There are at R+ defects in one group, which makes the Algorithm 2 jumps into group one time. This is similar to the number of faults in IaS. The WCETs are substantially smaller than normal IaS as shown in Fig. 4. As we can observe, with R = 2, the WCET of IaS becomes higher than 20,000 cycles with 32+ bit; however, IaSiG stays under 20,000 cycles even with 128-bit. By diving into groups, we now can tackle the first issue of IaS: testing time scalability.  In this case, the testing time is increased; however, the system obtains the best granularity where no false positive cases occurs. In terms of detection rate, the successful rate of IaSiG is: The proof of the above equation can be found in Lemma 1 of Section III-D. Apparently, increasing the value of K can help increase the detection rate; however, the WCET is increased. Therefore, designers should consider this trade-off in the design phase. If we consider the successful detection rate is sustainable for detection, the proposed method can guarantee the detection of R+1 and the correction of R faults. The upper bound of detection and localization are G(R + 1) and GR, respectively. The proof of these bounds can be found in the Lemma 2 of Section III-D.
C. ARCHITECTURE Figure 3 illustrates the design for IaSiG where Figure 3 (a) and (b) shows the configuration for one and two redundant TSVs. For more redundant TSVs, we simply add multiplexers/demultiplexers for connecting selections. Here, we support shifting using multiplexers, which is similar to the TSV recovery method in [31]. The architecture of IaSiG is similar to IaS where they share the ECC (parity check), ''TSV-Fuse'' blocks , synchronization TSV (s_TSV) as in Figure 3. Because the redundant TSVs (r-TSVs) are shared between groups, we add multiplexers and demultiplexers to support switching groups. The controller block gives instructions for switching between group/redundant TSVs and synchronization.

1) NETWORK-ON-CHIP INTEGRATION
In order to understand the cost of the design, we integrate the IaSiG into our previously designed 3D-NoC router [23] as shown in Figure 5. Please note that the proposed approach is totally independent from our opted router architecture and could be implemented into any TSV-based architecture. The IaSiG is integrated as an ECC module for only two vertical ports (UP and DOWN) to monitor and detect faults of TSVs. IaSiG-TX and IaSiG-RX are two additional modules, which help handling the TSV fault detection: the data from TSV is brought to the Multiplexer for selecting the proper connection then sent to the decoder. Parity check information of the decoder is sent to the controller of IaSiG-TX. The controller manages the ''TSV-Fuse'' (for multiplexer) and synchronizes between layers via s_TSV .
Previously in [23], we used two SECDED (16,22) codes to handle potential soft errors in the data. In this work, we use one parity check (32,33) for simplicity since the soft-error tolerance is not the main focus. We add only two spare TSVs (r_TSVs) and one synchronization TSV (s_TSV). Consequently, IaSiG uses 36 TSVs for each vertical connection. Also, we adopt the four-clustering structures of [23] where each connection is divided into four clusters of 8 TSVs (s_TSV is independent) and a cluster of 4 TSVs consists of parity check, spare TSV and synchronization TSV. The IaSiG also performs checking for four group of 8 TSVs.
For each vertical connection, if a cluster is defected, the router can choose one of its four neighboring clusters as a replacement without the need for redundancy. To satisfy the timing constraints, the router chooses the closest TSV-cluster among its neighbor clusters.

D. ANALYTICAL ANALYSIS
Lemma 1: Assuming the error probability of TSVs is independent, the probability of having a silent error after K transmissions is less than or equal to (1/2) 2K .
Proof: The probability of a silent error for open and a short defects are: P silent 1-bit open = P 0→0 + P 1→1 1/2 (6) P silent 1-bit short-to-substrate = P 0 1/2 (7) where P i is the probability of transmitting a logic value i in TSV and P i→j is the probability of transition from logic value i to logic value j. Here, we use P silent 1-bit = 1/2 as the probability of having a silent 1-bit. The probability of having a silent error while having f faulty TSVs (0 ≤ f ≤ M ) is: Equation 8 can be proven using the binomial theorem: By using x = 1 and y = −1: In other words, the number of having odd cases is equivalent to the number of having even cases. Meanwhile, the total number of cases is 2 f .
Equation 8 could be proven as follows: Because the error probability of each TSV is independent, the probability of having a full silent error after K transmissions is: Since Algorithm 1 uses double-check when it finds a healthy case, it performs one more K transmission. As a result, the silent probability is: Note that Algorithm 1 iterates from the least significant index TSV to the most significant one. The successful rate of the model is: Base on Eq. 16, we can conclude that: • Having longer transaction length K can reduce the probability of silent faults.
• Having higher number of redundancies could enhance the reliability; however, it increases the testing time.
• In real-applications, there is a chance that the system sending all-zeros or all-ones during the test (even the chance is relatively small), we think this issue could be fixed by a dedicated data. As shown in Algorithm 1, we use double-check to reduce the chance of hidden defects. By performing one more K transmission, we reduce the probability of silent defects to (1/2) 2K , which is a half of the probability in [27].

Lemma 2:
The method can guarantee the detection of R+1 and the correction of R faults. Furthermore, the detection and localization bound are G(R + 1) and GR, respectively.
Proof: First, we assume F is the number of faults. Regardless of F being odd or even, silent errors after K transmissions make the decoder detect the failed case. If F ≤ R, any cases of F faults in M TSVs could be covered by iterations from 0 to 2 M in Algorithm 1. Therefore, after a heuristic search through all possible cases, the system can match the F fault patterns once.
Given R redundancies, the maximum number of isolated faults is R since the system needs at least M − R TSVs to work. If after reducing to M TSVs, the output of decoder (S) is equal to ''1'', since there is one fault left. Therefore, R + 1 is the maximum number of detectable faults. Now considering G groups in IaSiG, where each group is performed separately. If the faults occur in all groups, IaSiG can localize up to G×R defects and detect the case of G(R+1) defects.

E. EMPIRICAL THERMAL AWARE REDUNDANCY INSERTION
As we previously mentioned, most existing works inserted redundancies uniformly to TSV groups to help recover the manufacturing defects. However, lifetime reliability is heavily affected by the operating temperature as in Black's model [25] where the fault rate at the temperature T (Kelvin) is accelerated by a factor π T as follows: where J is the current density and A, E a and k B are the preexponential factor, activation energy and Boltzmann constant, respectively. Since the temperature map of the 3D-ICs is not uniform, the fault rates are varied between the TSV groups. Therefore, we observe that with the ability to predict the critical area, we can efficiently insert more redundancies into the less reliable area. To find the vulnerable area, we normalize the acceleration factor π T to a reference temperature T ref to obtain the normalized fault rate (NFR T ) of each region of the 3D-IC: To predict the number of needed redundancies for lifetime reliability recovery, we proposed an empirical approach as follows: • Step 1: Analyze the 3D-NoC system with the desired applications. Here, we use PARSEC benchmarks as examples.
• Step 2: Simulate the temperature of each TSV group and extract the predicted fault rate with the Black's model [25] by using Equation 18. We choose the coolest area of the 3D-IC as the reference temperature.  target Mean Time to Failure (MTTF) or fault rate. Then, we perform a search to find the optimal redundancies for the target MTTF. Figure 6 shows the flow to predict the number of needed redundancies in our 3D NoC. Obviously, designers can choose different fault model, application, reference temperature, and target MTTF (MTTF target ) to have the suitable results. In Step 1, we first use commercial CAD tools to design and estimate the power consumption. Then, we can extract the energy per bit value. To estimate the power consumption of each router, we multiply the energy per bit value to the packet switching activities from the desired applications. In Step 2, we use the power consumption map and floor-plan to estimate the temperature of each router and its nearby TSV groups in the NoC. We further estimate the normalized fault rate by using Equation 18. The redundant TSV map for each router is used for the final design of 3D-NoC in Step 3. To minimize the WCET, we split the group with larger number of redundancies into several sub-groups with maximum two redundancies (R ≤ 2). We decide the number of redundancies based on two thresholds of fault rates (θ 1 and θ 2 ): To find the thresholds of fault rate (θ 1 and θ 2 ), we use a heuristic search. The tuning process of these two thresholds can be obtained via a grid search approach to find the optimal point for these two values.
In summary, this method helps estimate the needed redundancies for correcting TSVs under thermal awareness. By estimating the number of failed TSVs, we can reduce the area cost by removing the redundancies in low fault rate areas.

A. METHODOLOGY
The proposed architecture is designed in Verilog HDL using NANGATE 45nm library [41] and NCSU FreePDK TSV [42]. The design is implemented using Synopsys CAD tools. We evaluate the hardware complexity of the proposed design in comparison to other TSV recovery methods. We also perform the execution time analysis and compare this work with existing ones. Last, we discuss the possible optimizations for this proposal.

B. HARDWARE COMPLEXITY
The implementation results with different configurations of the proposed method are shown in Figure 7. Compared to our preliminary work IaS [27], we observe a lower area cost due to the reduction in the number of registers needed for configuration. Consequently, the power consumption is also smaller. When we increase the number of redundancies, we observe a significant increase in both area and power of IaS. However, the increase in IaSiG is insignificant. In terms of power, we even observe lower consumption of our design thanks to the fewer registers and lower complexity.
Among the different configurations of IaSiG, we can observe a slight drop in area cost with R = 1 when we increase the value of C. This is due to the lower number of registers needed for each group. However, with R = 2, we observe an increase because of the complexity added for shifting two TSVs. On the other hand, since the switching between clusters incurs additional power consumption, we still observe higher power consumption with higher number of groups. We also would like to note that with low defect rates, it could reduce the power consumption by skipping groups.
The hardware complexity results and comparison for 32bit implementations between this work and Error Correction Codes are shown in Table 2. We can observe that the proposed IaSiG has higher area cost and power consumption than most low complexity ECCs (Hamming, SECDED). With higher complicated ECCs (SEC-DAEC and TAEC), the total area cost of the encoder and decoder of IaSiG is much smaller. Note that IaSiG uses the least codeword bit (35), which also leads to smaller TSV area cost.  Table 3 shows the comparison between our method and existing testing frameworks for TSVs. We only compare the TSV testing while the system testing is shown in Table 5. As we can see in Table 3, this work at its best can capture the fault within 32 cycles, which is 1 cycle per TSV. However, on the longer end, it can require in average 49 and 101 cycles per TSV with R = 1 and R = 2, respectively. These values are higher than other works; however, our work is for online and non-blocking test. Therefore, the long testing time can be easily justified since the system can continue to perform as usual. In terms of area cost, our method also has a higher area than others due to its need to store the status of TSVs. However, its area cost is still reasonable since the overhead of a TSV is not overwhelming. One notable thing we can observe is that our approach is suitable for a small group of TSVs. It is certainly possible to have a huge number of TSVs per group; however, as we analyzed, the testing time is not scalable. To handle this problem, we divide the TSVs into groups and support parallel testing. Therefore, we can make the testing time to remain constant and scale with the number of TSVs. On the other hand, collecting test results in the prebond and post-bond test require a scan chain, which lead to testing time and area complexity of O(n) (n is the number of TSVs). Meanwhile, our method only requires O(1) for testing time and could maintain the same testing time while up-scaling the system complexity (more cores or layers).

C. COMPARISON OF TSV TESTING AND RECOVERY MECHANISM
In comparison to alternative shapes of TSV groups such as cobweb [32], sharing spares [34], or honeycomb [33], our area overhead is also higher. For instance, in the honeycomb and cobweb design (w.o. TSV), the area per TSV is 17.61 µm 2 and 4.85 µm 2 , respectively, while our value is 42.4 µm 2 . However, their hardware complexity does not include the controlling for detecting and localizing TSV defect. Moreover, our design adopts the shifting mechanism from [31] and can be compatible with the other recovery designs.

D. WORST CASE EXECUTION TIME ANALYSIS
In order to understand the performance of IaSiG under different number of faults, we perform the test under different scenarios as shown in Figure 8. The numbers of data-bit (M ) are 8, 16, 32, and 64. Here, we varied different group sets (C) between 2 and 4 TSVs. The number of redundancies (R) are from 1 to 6. Note that the value of R is also the number of injected defects. For realtime applications, the worst case execution time (WCET) is one of the most important criterion; therefore, we focus on measuring the WCET in this section. Although normal behavior could be faster, WCET represents the expected time of resulting test results. It is worth mentioning that with all designs (IaS, TSV-OCT, IaSiG), we use the K = 32, 64 and 128 since the experiment we conducted in IaS [27] shows 32 cycles is the best value that can guarantee no missing faults under random data. In this analysis, we first conduct Monte-Carlo simulation under 10,000 cases to find the WCET. Then, WCET analyses (see Section III-B) are conducted and verified with simulations. If there is an unmatched case, we conduct a specific case to verify the results.
In comparison to IaS, we easily observe lower WCET of IaSiG with a high number of redundancies. Notably, with a large number of data TSVs, IaS accelerates the WCET significantly while IaSiG maintains a reasonable one. For instance, with 32-bit TSVs, the WCET values of IaSiG are less than 20,000 cycles in all cases while IaS is even higher than 10 8 cycles. Even with the highest value configuration (R = 6, K = 128, and M = 64), IaSiG is still less than 30,000 cycles while IaS is above 10 10 cycles. In summary, we observe that with a large number of redundancies IaSiG dominates IaS in terms of WCET. For smaller numbers of redundancies, IaS is better; however, the WCET of IaSiG is still smaller than 20,000 cycles, which is a reasonable test time for on-chip communications [37], [38].
In comparison to TSV-OCT [35], which is an error correction code with augmented algorithm, we can easily observe faster WCET under lower defect rates. This is due to the fact that the built-in ECC in this work already support localization of one fault; therefore, the execution time is much faster. However, with higher numbers of defects, we can easily observe the increase of WCET. With 4 defects, and K > 32, TSV-OCT has longer WCET than our proposal. While TSV-OCT is limited to 5-6 fault localization depending on the value of K , our method is only limited by the value R and can support defect detection in multiple groups. For instance with 32 bit (M = 32), cluster size C = 4 (G = 8 groups), and redundancy R = 4, IaSiG can detect at least R = 4 defects and at most M × R = 32 defective TSVs. Also, while TSV-OCT does not support recovery, IaSiG can support the recovery of R defects. The number of additional TSVs is also lower, IaSiG uses R = 4redundant TSVs while TSV-OCT requires 9 × 5 TSVs for 32-bit (13 extra TSVs).
As discussed in Section III-B, IaSiG under case-2 (R defects in all TSVs) has less WCET than case-1, where IaSiG encounters R defects in each group. A significant reduction in WCET could be seen in Figure 8, where case-2 has half of the execution time in comparison to case-1. However, we would like to note that with case-1, IaSiG can detect up to G × R defects. Since the WCET only scales with the value of G, R, and K, the WCET of the system is identical to a group. To maintain a reasonable WCET, designers can choose proper values to work with. However, if the WCET should be smaller than 1000 cycles, the choice is limited to 8 TSVs and 1 redundant one. We would like to note that reducing the number of cycles per test (K) can scale down the testing time; however, as we illustrated in Table 1, it can reduce the accuracy, which leads to hidden faults.
In summary, among the on-communication test techniques, IaSiG offers better scalability when the WCET slightly increases with the number of redundant TSVs. For the other works (IaS and TSV-OCT), we could observe the rapid increase of WCET. Table 4 shows the hardware complexity of the router design. The current work is based on the baseline router in [23]. Here, we compare with our previous works in [23] and [35] that use SECDED and PPC coding techniques, respectively. In comparison to these works, we observe that tour proposal has lower area cost than the TSV-OCT router [23] thanks to the lower complexity of the encoder and decoder and the smaller number of TSVs in a single connection (36 instead of 45 per connection). The area cost is similar to the router using SECDED code [23]. Note that both previous works use ECC so they can deal with soft errors while this work needs re-transmission. Table 5 shows the testing time for Network-on-Chips. Here, we could easily observe our method is the fastest where the worst case (5,152 cycles) is smaller than other methods. Work in [35] can have a fast execution time; however, the worst case can reach up to 16,000 cycles, which is 5× the value obtained with proposed approach. We would like to note that in [35] the baseline test requires64 cycles, while in this work we confirm that 32 cycles provide the best results. With the double-check mechanism, we could elapse the test time twice to 64 cycles. For a 32-cycle baseline test, the work in [35] requires from 32 to 4128 cycles to perform, which is still larger than this work. This method supports a limited number of fault localization due to having only two redundancies while it is the only method providing online recovery among all the compared works.

F. EMPIRICAL THERMAL AWARE REDUNDANCIES INSERTION
In this part, we perform an empirical method to insert a suitable number of redundant TSVs in each group. Using the flow in Figure 6, we first use our baseline 3D-NoC to perform energy simulation by using Synopsys CAD tool. Here, we extract the average energy per bit of the proposed design. Then, by extracting the switching activities of PARSEC benchmarks [47] under gem5 [48] simulator with 3D mesh topology of Garnet NoC (64 cores: 4 × 4 × 4), we obtain the power consumption for each router inside the NoC. We then emulate the temperature using HotSpot 6.0 [49] to predict the temperature of each router. As we use the coolest area as the reference temperature, we can VOLUME 8, 2020  extract the normalized fault rate of each TSV group using Equation 18. Based on the fault rate, we insert proper redundant TSVs and arbitration modules to help detect and correct defects. Since the number of redundancies depends on the target MTTF (MTTF target ), designers can select a target value for their design. In this evaluation, we choose 1.5×, 2× and 2.5× the corresponding MTTF of the system at T ref (MTTF T ref ). Figure 9(b) illustrates the result of our empirical thermal aware insertion with MTTF target = 2 × MTTF T ref . As we can observe, under non-redundancy case (10:0), the MTTF of the system significantly drops to less than 50% of the target MTTF. Meanwhile, inserting one redundancy per group nearly reaches the target MTTF while inserting two redundancies overcomes the target MTTF. Here, our empirical method tries to optimize the number of redundancies per TSV group. As a result, the empirical approach approximately reaches the target MTTF. Meanwhile, the number of redundancies with the empirical approach is lower than 11:2 and higher than 11:1. This can be easily explained by Equation 19 where the proposal method chooses between either 0, 1 or 2 redundancies. Figures 9(a) and 9(c) show the results with the target MTTF as 1.5× and 2.5× the MTTF T ref . In both cases, the method approaches the target MTTF by using a grid search. With the target MTTF as 1.5× of MTTF T ref , it is natural that we need lesser TSVs than Figure 9(b).We can easily observe that constantly inserting either 1 or 2 redundant TSVs leads to higher MTTF than our desire. Despite the fact it can be more resilient, it needs extra area cost. With the 1.5× case, the number of redundancies is lower than R = 1. Here, we can observe grid search can provide better area cost by reducing the number of TSVs. Here, the distribution of redundancy is no longer uniform; however, the overall reliability is still approximately as desired. On the other hand, the 2.5× case still keeps the number of redundancies less than R = 2 and larger than R = 1, as can be seen in Figure 9(c). By selecting the optimal values of R for each group, we can have less TSVs while still maintaining the desired MTTF.
In summary, by heuristically searching for the optimal threshold values, the proposed approach can match the target MTTF while reducing the number of redundant TSVs. We would like to note that the WCET is equivalent to the WCET of the group with the highest R value. Therefore, grid search can optimize the number of TSVs; however, it cannot ensure the optimal WCET. On the other hand, we still cannot vary the value of M with grid search.

G. DISCUSSION
In the previous evaluations, we have presented the efficiency of IaSiG. Despite the obtained advantages, there are some challenges that should be addressed in order to further enhance the detection ability of IaSiG, as discussed hereafter.
First, this work does not take into account the occurrence of metastability, which could be solved by an immune circuit or a voltage comparator [14]. Also, the metastability phenomenon could be stabilized using several flip-flops and samplings.
Second, IaSiG is based on parity-check. Unlike TSV-OCT [27], which is based on ECC, parity-check cannot localize faults but can only detect them. As a result, IaSiG is not resilient to soft errors, which is out of the scope of this work. However, we can wrap IaSiG on top of an ECC to protect the data without encountering any incompatibility.
Third, the impact of real-chip implementation and processvoltage-temperature (PVT) variations has not been taken into account in this work. However, the efficiency of our algorithm is independent from the mentioned variations. Note that our target is lifetime reliability; therefore, we here assume the manufacturing defects are well recovered. Our execution times are analyzed in cycles, which can be varied with the frequency of the design.
Fourth, in Section III-E, we show an empirical method to decide the number of redundancies in each TSV group. Apparently, the selection is constrained by several parameters such as design, reference temperature, target MTTF. We also want to note that inserting one or two TSVs nearby a router can slightly reduce its temperature. While the reduction is ineligible, designers can modify the configuration of HotSpot 6.0 to obtain the updated temperature. Moreover, MTTF represents the average time to failure and may not reflect the worst case scenario.
Fifth, the selection of two threshold values is obtained via a grid search approach in our empirical method. This can help the designer to find an approximate solution after a certain execution time. However, we would like to note that the selection process is still an open problem and out of the scope of this work. Our approach only demonstrates the ability to reduce the number of redundancies while maintaining the target MTTF. Since the examples in our evaluation still require up to two redundancies per group, the WCET is still as same as the R = 2. However, by lowering the target MTTF, the system can reduce the number of redundancies and can lower the WCET to R = 1.
Sixth, the main hypothesis of this work is to assume that the redundant TSVs (r_TSV ) and feedback TSV (s_TSV ) must be healthy. Obviously, this is a post manufacturing test and recovery so we can ensure the correctness after production. Furthermore, we avoid to use the redundant and feedback TSVs during operation to avoid wear-out. Nevertheless, the defect on these TSVs could lead to incorrect behaviors. While defective redundant TSVs could be detected, defects on feedback TSVs could lead to synchronization issues.
Seventh, although we only demonstrate the usage of our approach on a 3D-NoC under the PARSEC benchmarks, there is no limit on the application of this work with TSVs in 3D-ICs in general. Depending on the number of needed TSVs and the target reliability (or MTTF), designers can build their detection and recovery using either IaS or IaSiG. The empirical thermal aware insertion can also provide a certain confident number of redundancies needed for obtaining a target MTTF. Also, designers can adopt different reliability models rather than Black's model. The impact of different parameters (i.e., voltage, frequency, feature size) can also be integrated into the reliability estimation.
Finally, since our design has been completed in commercial CAD tools, the implementation of our approach is ready to be integrated onto any TSV-based 3D-ICs for production. Designers must carefully choose the parameters (R, K, or G) to satisfy their specifications. The computation of the controlling mechanism is fully implemented in hardware; thus, there is no extra computation needed.
Despite the above-mentioned limitations, we believe that IaSiG still provides extra defects' localization while maintaining short execution time. The exhibited overhead in a 3D-NoC implementation is also reasonable, which makes IaSiG a totally promising solution for integration into highly reliable 3D ICs.

V. CONCLUSION
This work has presented a light-weight method to enhance the online detectability and provide online localization and recovery for TSV's faults. The proposal isolates the possible fault position and checks the output syndrome to indicate the fault-free situation. To reduce the testing time and area cost, the proposed method divides TSVs into groups, which will have less TSVs per test. The results shows that the proposal has a reasonable area cost while guaranteeing localization and recovery up to the number of redundancies. An implementation of 32-bit version with 3D-NoC has less than 6000 cycles of testing time, which is significantly lower that the stateof-the-art works. Thermal adaptation for spare TSV insertion is also discussed in our proposal to help reduce the number of redundancies while still guaranteeing the desired MTTF. In the future, applying the proposed method for neuromorphic applications will be investigated. Different TSV group shapes and repairing mechanisms such as cobweb or honeycomb is another considerable approach. University of Aizu. His research interests include computer system and architecture, with an emphasis on adaptive/self-organizing systems, networks-onchip/SoCs, processor micro-architecture, and power and reliability-aware architectures. He is also interested in neuro-inspired systems and VLSI design for 3D-ICs. He has authored three books, published more than 150 journal articles and conference papers in these areas, and given invited talks as well as courses at several universities. He has been a PI or a CoPI of several projects for developing next generation high-performance reliable computing systems for applications in general purpose and pervasive computing. He is a Senior Member of ACM and a member of IEICE. He is also the Director of the VNU Key Laboratory for Smart Integrated Systems (SISLAB). His research interests include design and test of systems-on-chips, networks-on-chips, design-for-testability, asynchronous/synchronous VLSI design, low power techniques, and hardware architectures for multimedia applications and cryptography. He is a Senior Member of the IEEE Circuits and Systems (CAS) and the IEEE Solid-State Circuits and Systems (SSCS), a member of IEICE, and the Executive Board of the Radio Electronics Association of Vietnam (REV). He also serves as the Chairman of IEICE Vietnam Section and the IEEE SSCS Vietnam Chapter. VOLUME 8, 2020