Syndrome-based Min-sum vs OSD-0 decoders: FPGA Implementation and Analysis for Quantum LDPC codes

Quantum processors need to improve their reliability to scale up the number of qubits and increase the number of algorithms that can execute. To reduce the logical error rate of the quantum systems, the use of error correction codes and decoders has been established as a low-cost and feasible approach, with good results from a theoretical perspective, for mid and long-term architectures. While most of the authors are focused on the algorithms to improve the correction capability of quantum computers, without taking into account a fundamental implementation aspect for their deployment in a real system, i.e., their latency must be bounded to avoid the qubit decoherence, only a few propose hardware architectures and they just include time estimations of their decoding latency. However, a real implementation has not been shown yet. In this work, we analyze from the point of view of hardware implementation two algorithmic options based on quantum low-density parity-check (QLDPC) codes: a) belief propagation min-sum decoders combined with codes with good error-floor behavior and b) belief propagation min-sum decoders concatenated with ordered statistics decoders (OSDs) for codes with early error-floor. The bounds for the maximum clock frequency required by the decoders to decode within the qubit coherence time are established as a parameter to show if a practical implementation is possible with the present or near future FPGA technology. Furthermore, real implementation results for a Xilinx FPGA device are provided, showing that some solutions can meet the timing constraints set up by the state-of-the-art quantum processors.


I. INTRODUCTION
Q UANTUM error correction codes have been deeply studied for more than three decades [1], [2], [3]. Several solutions based on different codes such as: Calderbank-Shor-Steane (CSS) codes [4], Shor codes [5], 2D color codes, 3D color codes [6], surface codes [7] and quantum low-density parity-check (QLDPC) codes [8] have been proposed to correct errors in quantum processors. In addition, different decoding algorithms based on techniques like Blossom decoder [9], message-passing [10] and artificial intelligence (neural-networks) [11] have been described showing the benefits and the error-correction capacity of each method.
Even though these codes and decoders show several orders of magnitude of improvement of the output error rate of the quantum processors, from theoretical analysis, most of the solutions do not provide physical implementations and hence, it remains unclear that if in a real system these solutions are fast enough to handle with the qubit decoherence (which is the perturbation of the superposition states of the qubits by their interaction with the environment) [12]. This implies that the time to perform the error correction is bounded. Due to this, the implementation of any quantum error correction system has to consider the time budget of a single error correction round which is set between hundreds of nanoseconds [13] and several microseconds [14], to ensure that quantum information remains stable.
Another fundamental parameter is scalability of the number of qubits [15]. The error correction hardware needs to be feasible for not only tens or hundreds, but for thousands of qubits. Only with a large number of qubits, improvements of the quantum over classical systems will be highlighted [16].
Among error correction codes that can overcome these challenges, sparse quantum stabilizer codes -QLDPC codes are gaining prominence [8], [17]. Since the theoretical result that QLDPC codes promise scalable quantum computation with a finite overhead, asymptotically good codes and their efficient decoders have been proposed [18]. Compared to surface codes and color codes of similar lengths, QLDPC codes boast better code rates with fault-tolerant thresholds supported by low-complexity iterative decoding algorithms [19]. Recent breakthroughs achieved in improved scaling of minimum distance of QLDPC codes [20] pushes for further improvement of iterative decoders and more importantly, efficient decoder implementations that are scalable.

A. CHALLENGES FOR THE IMPLEMENTATION OF QLDPC DECODERS
In terms of decoder implementation, numerous works have been done for the classical counterparts of QLDPC codes [21]- [22], obtaining good hardware results in different areas with high specifications in throughput and area-power consumption, such as optical communications, storage systems or post-quantum cryptography applications [23].
However, from the hardware point of view, there is one important difference that does not make most of the classical architectures directly applicable to quantum systems: most of the existing solutions are based on improving speed/throughput in the decoder and do not consider latency, or at least the latency restrictions are not critical. This problem can be understood by comparing the equation of throughput, =( · max )/(It· ), and latency, =(It· )/ max , for the parallel decoders, where is the length of the codeword in terms of bits, It is the number of iterations, the number of clock cycles per iteration and max the maximum frequency of the device. Classical LDPC decoders for high-speed try to maximize max by introducing a large number of pipeline registers to reduce the critical path and hence increase , but they compensate this increase of using long LDPC codes, with a large parameter (which also have very good error correction properties for classical decoders). For quantum systems, the only inputs that the decoder receives are the quantum syndromes, which are sent in parallel, so does not have any effect on the input latency, in terms of clock cycles. On the other hand, the decoder needs to be very fast to avoid decoherence effects, so only parallel architectures meet the latency requirements in terms of clock cycles. Unfortunately, for parallel decoders, does not have any effect on the number of clock cycles. The number of pipeline registers is the one that determines the number of clock cycles per iteration, , and contributes to improve the max . The parameter only may have a negligible effect in routing and hence in the maximum frequency, but for the quantum decoders the critical parameter is . So, if the number of pipeline registers was increased, is also increased, and increases faster than max . In other words, triplicating the pipeline registers it is not possible to triplicate max for problems with routing. Hence, latency is not optimum for classical decoders, which is the main constraint of the quantum ones.
Finally, it is crucial to perform implementations of these decoders in the selected hardware platforms, as the timing limitations cannot be derived only from the synthesis results or the estimation of the delay by counting the number of gates involved in the critical path. As we show in the following sections of this paper, the most limiting component of max is routing, which can only be calculated by the development tools with the after place and route results.

B. RELATED WORKS
QLDPC codes, similar to classical LDPC codes apply message-passing decoding based on belief-propagation (BP) algorithms obtaining good performance [24], [25]. However, some QLDPC codes fail introducing an early degradation, the error-floor. The error-floor is the effect of not improving the logical error rate of the quantum system with an error correction decoder even when the physical error rate of the quantum processor improves. To solve this problem, the most studied solution, proposed in [19] and [26], is the concatenation of a classical algorithm called ordered statistics decoding (OSD) [27] to a BP-based decoder, i.e. to apply OSD when BP fails, which is a different approach from other classical works where the OSD is used offline to optimize some parameters of the BP decoder [28]. This improves the error correction in most QLDPC codes, but there are great differences in its final effect depending on the code. For some codes there is a slight improvement of the total error correction capacity of less than one order of magnitude on the output error-rate, for others, OSD completely removes the error floor.
The main problem of a derived implementation of OSD is that the number of rounds is exponentially linked to the data bits from the code. This problem was reported in [19], where OSD, although interesting to set a boundary in terms of error correction capacity, is labeled as impractical for real-time implementations. As an alternative, OSD order 0 (OSD-0), which is a one round OSD algorithm, is proposed. Although it is evident that exponential complexity is avoided, under our best knowledge no implementation of OSD-0 algorithm for quantum processors has been reported yet. Therefore, the question that remains open is if OSD-0 can meet the timing constraints of the quantum processors or not, as it involves costly steps such as matrix inversion.
Under the best knowledge of the authors, a practical hardware architecture of a quantum error correction decoder has not been published yet. Most of the papers published are focused on the algorithms to achieve the target correction capability and only a few, [29], [30], consider the implementations aspects to achieve the time budget to avoid the qubit decoherence, and they include an estimation of the latency of their potential implementation.

C. PAPER ORGANIZATION
In this paper, two decoding algorithms will be evaluated from a VLSI perspective, comparing its performance and calculating what should be the expected latency and clock frequency in real hardware implementations. The first one is the syndrome-based min-sum (SB-MS) decoder [31], which is an iterative BP-based solution with a good tradeoff between correction and physical constraints. However, SB-MS is not applicable to different classes of stabilizer codes in general due to error floor-problems. The second one is SB-MS concatenated with OSD-0, to improve the waterfall performance and alleviate the error floor problems. However, this performance improvement comes with increasing the total latency time due to computationally intense operations involved. The objective of the paper is to obtain timing, area and power results to evaluate if it is possible to apply these algorithms with the present technology to a quantum processor meeting the constraints of the qubit decoherence.
The rest of the document has the following structure: in Section II, the background concepts about complexity and correction capacity are summarized; in Section III, different architectures for SB-MS and OSD-0 will be analyzed. The results of these architectures are shown in Section IV, and in Section V the main conclusions of the paper are highlighted.

A. QUANTUM ERROR CORRECTION AND CHANNEL MODEL
Despite classical error correction decoders, quantum error correction (QEC) decoders do not take as initial information the corrupted message from the channel [16], + , where is the codeword vector and the error introduced by the channel. Due to the nature of qubits, data cannot be read without altering it, for this reason, the main solution to introduce error correction is to use the information that comes from the syndromes, , calculated at the quantum processor [32]. In that sense, instead of looking for a correct codeword (˜) through the search of a candidate that satisfies the parity check equations, the objective of the error-correction system is to search for an error pattern (˜) that generates the same syndromes as the quantum system. By using the syndromes, the qubits that contain data are not altered and the correct error pattern can be just added to the distorted codeword recovering the information. The main differences of both classical and quantum decoders can be seen in the block diagrams from Fig.1.
Another important change is the error model of the quantum system. The errors that modify the codeword in a quantum processor can be defined by a depolarizing channel model that inserts bit-flips, phase-flips or a combination of bit and phase flips [16]. The bit-flips are defined by Pauli errors of type , with a probability of . The phase-flips are modeled as Pauli errors of type with a probability of error of . A combination of bit-flips and phase-flips in the errors is described by a Pauli error of type with probability . If the error model assumes a symmetric depolarizing error channel, the probability of error of the model is = /3 + /3 + /3.

B. QUANTUM LOW-DENSITY PARITY-CHECK DECODERS
Unlike classical LDPC codes, which are defined by just one × parity-check matrix , QLDPC codes are defined by two different matrices: to correct theerrors (or bit-flips) and to correct the -errors (or phaseflips) from the depolarizing channel. The constructions of these matrices need to satisfy the symplectic inner product constraint arising from the stabilizer formalism based on the Pauli-to-binary isomorphism (for more details about code construction we refer to [16] and [24]). In this document, for simplicity, we will refer to and as indistinctly, without loss of generality. However, in Section IV, there will be an impact of the architecture depending on the code is dual containing ( = ) or non dual containing ( ≠ ). QLDPC codes are usually named as ( , − rank( )) QLDPC. The code can be graphically represented as a bipartite Tanner graph, with two types of nodes: checknodes that correspond to the rows of parity-check equation; and variable-nodes that correspond to the columns or the information of each of the bits in the codeword. Each checknode (variable-node) of the code has a degree of check-node (degree of variable-node ), which is equal to the nonzero elements of a row (column).
SB-MS is a message-passing algorithm derived from BP that exchanges two types of messages, the ones generated in the check-node, , , and the ones computed in the variable-node, , . The algorithm tries to estimate the error pattern in the variable-nodes to match the input syndromes. In terms of hardware, there are two types of nodes that exchange information and the complexity of the derived decoder will be limited by the code parameters, i.e. degree VOLUME 4, 2016 of the check-node , degree of variable node , number of parity-check equations and number of total bits in the code [33]. The operations computed by the check nodes and the variable nodes are described in equations (1) and (2), where is the syndrome for the parity-check equation and N and M are defined as the sets consisting of all the non-zero elements of a row (check-node) and a column (variable-node), respectively, ⊕ is the mod2 addition (one-bit XOR operation) and is a scaling factor to improve the convergence of the algorithm. The hard decision operation (HD) included in equations (1) and (3) is defined as the conversion from a soft-decision message to a logical one. Usually, it is equivalent to the bit that represents the sign, considering the positive sign as a logic 0 and the negative sign as a logic 1. The scaling factor is obtained by performing Monte Carlo simulations, as happened with classical LDPC decoders, to optimize the logical error rate of the processor under a depolarizing channel. The value of the factor is constrained between 0 and 1 and is usually selected as In this way, we can perform the products by just shifting the bits and adding two shifted words. The shifting process is implemented with wires, so no additional hardware is required. In the worst case, only one adder is implemented, avoiding the use of multipliers that increase the hardware resources and the critical path.
is the reliability of , which is the value of the error in the bit . As is unknown, is initialized to a constant, usually to 1 − 2 /3, but as all the values are the same for all the indexes, it can be initialized to +1 to keep messages small without any performance loss.
The algorithm will iterate applying recursively these equations until one solution is found or a maximum number of computations is reached. The solution has to satisfy equations (3) and (4).

=˜(4)
As it will be shown in the next section, the solution with lower latency consists of a parallel implementation of one iteration, mapping (1), (2) and (3) for × nodes, but at the same time has the problem of routing between processing units, which makes difficult to ensure hundreds of nanoseconds of latency.

C. OSD IMPACT ON QLDPC ERROR CORRECTION PERFORMANCE
Recent work [19] and [26] conclude that decoding algorithms derived from BP (such as SB-MS) do not provide a correction capacity good enough for all QLDPC codes. Some codes such as the (882, 24) QLDPC or the (1024, 30) QLDPC exhibit an early error-floor when BP-based algorithms are applied. The difference between the physical error rate of the quantum processor without QEC is almost the same as the one obtained after introducing the decoder. To solve this limitation OSD is concatenated, improving the problem of the early error-floor.
However, not all the codes need the support of OSD to obtain a good error correction performance. Other codes such as (254,28), (7938, 578), (882,48) or (126,28) QLDPC 1 show a good error correction capacity with BP-based algorithms like SB-MS and the inclusion or not of the OSD algorithm does not entail a significant improvement. In Fig.2, some codes are included to show the gap between SB-MS and SB-MS with OSD concatenated. As it can be seen, the difference between SB-MS and SB-MS with OSD is progressively increased with higher reliability physical error rates (PER). For example, for the (126,28) QLDPC the output error rate is about 1.3·10 −5 @PER=10 −2 including OSD and 2.6·10 −5 @PER=10 −2 without OSD, less than one order of magnitude. For the (255,32) QLDPC the difference is more significant 1.6·10 −4 @PER=2·10 −2 without OSD vs 3.6·10 −5 @PER=2·10 −2 , almost one order of magnitude. The largest difference can be found for codes like (625,25) and (900,36) QLDPC where the gap is almost two orders of magnitude [19]. For these codes, the impact of OSD is more significant due to the bad decoding of SB-MS algorithm. However, there are alternatives for similar number of qubits i.e., (126,28) and (255,32) QLDPC codes, as alternative to (625,25) and (900,36) QLDPC codes respectively, that only with SB-MS behave better than SB-MS+OSD. But the key is: i) what is the cost of OSD? and ii) is it feasible to implement this algorithm to fit in the time window set by the qubits decoherence? Finally, it is also important to take into account that the output of the OSD algorithm only has two possible options, a logical error or the correct error pattern, as the output obtained is just binary, as it will be shown in the next section, OSD does not provide any soft information. For this reason, the output of the OSD algorithm cannot be improved in terms of correction by using any other method concatenated, so all the decoding failures turn into logical errors. While the output of SB-MS decoder, if it does not find the error pattern, is just a decoder failure that includes soft information from the messages that can be processed by other concatenated decoders. 1 All these codes were simulated using a CPU Intel Core-i7-6700-HQ CPU@2.6 GHz using a software model developed in Matlab R2015b. The results of some of these simulations are included in Fig.2 and 5.

D. OSD AND OSD-0 COMPLEXITY
OSD was originally introduced in [27] for classical codes to perform post-processing based on an exhaustive search which requires from matrices inversion in an iterative process that scales with 2 −rank( ) . This same algorithm, as it was said before, was evaluated to set a boundary for QLDPC codes to check if concatenation can improve BP-based decoders. Nevertheless, a recent study also focused on the combination of BP and OSD [19] reported that, searching over all configurations that are required in the original OSD soon becomes intractable for large codes, concluding that a more realistic approach is just to apply the simplest version of the OSD decoder, named as OSD-0. Note that for the (255,32) and (126,28) QLDPC codes, whose performance is in Fig.2, 2 32 and 2 28 rounds are required with the original OSD in the worst case, respectively. In principle, this complexity does not justify the improvement in the error rate, because, among other things, it is impossible to implement an OSD decoder that reaches an acceptable decoding latency. Although in [19] there is not a hardware analysis or implementation, there is a clear demonstration of why OSD-0 is the only presumable version of OSD that can be integrated into a real-time system, for this reason, is the only that we will consider in this document. The objective of OSD and OSD-0 is to take the reliability value of˜from equation (3), , , when BPbased decoders have failed to converge and look for the error pattern with the highest probability to satisfy equation (4). To do so, the steps of OSD-0 are: , from the most probable indexes of to have an error (˜= 1) to the less (˜= 0). Keep the sorted set in J OSD . 2) Reorder the columns of according to J OSD obtaining OSD . 3) Build a square matrix with the first rank( ) linearly independent columns of and the rank( ) independent rows, to look for an invertible matrix OSD . 4) Compute the inversion matrix of OSD : OSD −1 . 5) Calculate an error pattern that satisfies the syndromes: OSD = OSD −1 · . 6) Build the final error pattern˜O SD in which the locations that correspond to columns not included on OSD are equal to 0 and the locations included in OSD are equal to˜ OSD . 7) Undo the reordering of step 2), to correct the right indexes of the codeword. It is easy to see that OSD-0 is not an iterative algorithm, but it entails complex operations like the sort of elements in step 1) or a dynamic inversion of the matrix in step 4).

III. ARCHITECTURES FOR SB-MS AND OSD-0 ALGORITHMS AND LATENCY ANALYSIS
In this section, the hardware architectures for both SB-MS and OSD algorithms will be introduced to evaluate if the timing constraints of the qubits decoherence can be accomplished in different scenarios.

A. ARCHITECTURE FOR A SB-MS DECODER
In order to accomplish the latency requirements we propose a parallel architecture, which consists on the implementation of one iteration of the SB-MS algorithm. Besides, the parallel broadcasting techniques such as the ones in [34], are applied as described next.
The complete decoder will implement check-node units (CNUs) such the the ones from Fig. 3, and variablenode units (VNUs) like the one in Fig. 4. All the cells will be interconnected according to the parity-check matrix of the code. If the code is dual containing two decoders with exactly the same architecture are required; if the code is non dual containing the two decoders have different interconnection between CNUs and VNUs for and , but the number of wires will be the same. However, in both cases their complexity and number of wires will be the same.
Each CNU (the -th one) receives as input values of , with ∈ N and the corresponding syndrome . This unit has two different parts: • The computation of the parity check equation to obtain the syndrome , which is implemented by a tree of XOR gates that compute all the HD( , ) (sg( , ) in Fig.3 and 4) and the syndrome. After the tree of XOR gates, the , with ∈ N \ is calculated by subtracting to the total its own contribution. This sub-unit implements ∈N \ HD( , ) ⊕ from equation (1), generating hard-decisions (1-bit per hard-decision) that are sent to the corresponding VNUs (sg( , ) in VOLUME 4, 2016 Fig.3 and 4). Moreover, signals of one bit are also exchanged to indicate if the connected VNU contains the first minimum or not (sl( , ) in Fig.3 and 4). • The computation of the magnitude of the exchanged messages is obtained looking for the first and the second minimum of the received , values. Once these minimums are obtained, they are scaled by the factor . This factor improves the convergence speed of the algorithm. In order to simplify the implementation, the set of possible values are reduced to = 2 − + 2 − with , ∈ {1, 2, 3}, so hardwired multiplication based on one shifter and one adder is implemented instead of a complete multiplier. In order to reduce the routing congestion of the architecture, instead of sending −1 messages with the first minimum and one message with the second one, a broadcasting technique is adopted.  Each VNU (the -th one) has 2 × inputs of bits as each of the connected CNUs sends two minimums plus hard-decisions ( bits) and bits to select between the first and the second minimum. In the first step, the architecture selects if it needs the first or the second minimum. After this, the next stage, which initializes to zero the first iteration, combines the hard-decision and the magnitude to compute with a tree of adders + · Σ ∈M , from equation (2). As it is explained in the previous section is set to a constant for all the values of , i.e., = 1, because the reliability of the error message is unknown in the first iteration and all the bits have the same reliability.
After the tree, each message eliminates its own contribution from the total, via a subtracter, to implement , values in equation (2). During the last stage, the values are saturated to control the growth of the data path. The messages , are split into hard-decision and magnitude and send to the connected CNUs. The outputs of both sub-units, CNUs and VNUs, are registered to limit the critical path of the circuit.
According to the previous descriptions, it is easy to derive that the total latency per iteration is equal to two clock cycles, one to reach the outputs of the CNU and another to complete the operations in the VNU. So, assuming that the syndromes are available in parallel at the input, the global delay is equal to max = 2 × max / max , where max is the maximum number of iteration and max is the maximum frequency of the architecture. It should be noted that the max is limited by the delay of the CNU and that of the routing cables, which are quite congested due to the large number of cells to be connected.  With this information, a range of required frequencies for the hardware implementation can be computed taking as reference the time budget from real quantum processors. Although this range can be modified by the technology of the processor itself, the number of qubits and the encoding of the information, it allows hardware designers to have an order of magnitude to evaluate if the architectures are feasible to be applied in real systems or not. For example, taking as starting point the most restrictive time budget found in literature [14], [13] that is 400 ns, 2 the maximum frequency of the circuit needs to be max =2×20/400 ns=100 MHz, assuming 20 iterations. If both and are decoded with the same device, i.e. for dual containing codes, twice this speed need to be reached, so it would be higher than 200 MHz.
This maximum frequency could seem reduced and easy to obtain with the actual FPGA technology, however, although it is plausible is not easy to get it under any type of circumstances or code parameters, as there is a long critical path, with just two pipeline registers in the whole decoder, which depends on and . Besides, the routing congestion increases with , , and , and limits the maximum frequency by more than 50% compared to the limitation of logic depth, as it will be shown in Section IV.

B. ARCHITECTURES FOR OSD-0 ALGORITHM
Several architectures for OSD-0 have been proposed for classical error correction [35]- [36], and as the algorithm is exactly the same for QEC, the classical hardware design should be evaluated for the quantum scenario.
From the steps described in Section II, the most critical one is the inversion of OSD (step 4). The operations involved in this step are going to establish the bottleneck of the OSD-0 architecture.
The inversion of the matrix does not have any lowcomplexity or suboptimal approach as happens with, for example, the sorting process of the incoming messages (step 1). To perform it, the Gaussian elimination with Galois Field GF(2) needs to be implemented. The two main architectures to perform Gaussian elimination in an efficient way are introduced in [37] and [38]. The first one [37] is made of a network that connects all the elements from the matrix in parallel, so it is a network with × nodes, and pivots columns and rows to choose the columns and rows to eliminate. Although it is a parallel implementation, the algorithm needs at least 2 rounds and at most ( 2 + /2) to compute OSD −1 . So, in the best scenario, the latency of the Gaussian elimination implementation is determined by max =2 / max . In Table 1, we summarize the maximum frequency for different codes and for the latency constraint used as a reference. As it can be seen the value of max required by the (882,24) QLDPC code is not reachable by actual FPGA technology. Furthermore, it is important to notice that we only consider the delay introduced by the Gaussian elimination required by OSD-0, but the time budget is even less as the delay of the QLDPC decoder has to be also considered and the same for the rest of the steps of OSD-0, here we assume them as negligible just to establish an optimistic boundary 3 . As it is mentioned in Section II, this code needs OSD-0 to avoid error-floor, however, as it is analyzed here a real-time implementation, in this case, is not possible, so it is discarded as a real solution. In [38] an efficient architecture for Gaussian elimination with fewer resources and without routing congestion is proposed. This architecture implements a systolic array that is connected to a small number of processors that are neighbour nodes in the matrix. This reduces the total number of wires and global connections increasing the maximum frequency, but at a cost of increasing the number of clock cycles required to 3 + −2. Repeating the analysis performed with the parallel implementation the equation that links the maximum frequency with the delay is max =(3 + − 2)/ max . As it can be seen in Table 2 results are even more restrictive with this approach making unrealistic to expect a real implementation being tightened up to the time budget imposed by the decoherence. According to all the previous analysis, it can claim that OSD-0 derived architectures are too complex or too slow due to the high number of operations and iterative calculations involved, and they are not a realistic approach to decode QEC. The frequencies that FPGA devices should reach are not feasible with the existing technology and hence it is not possible to meet latency constraints because within this same time budget of 400 ns other operations such as the sorting of the indexes according to the probability values (step 1 of OSD-0) or the reordering of the indexes (step 7) should also be performed. Besides, before applying OSD-0, a SB-MS decoder is applied, and due to its iterative nature already consumes 80% of the time budget, as it will be shown in next section. For this reason, it is more reasonable to focus the efforts on optimizing BP-based decoders for QLDPC codes that show good behavior in error correction. Moreover, it is important to remark that VOLUME 4, 2016 OSD-0 provides a binary error pattern without any softinformation, turning a decoding failure into a logical error, while BP-based decoders keep the soft information from the messages after a decoding failure, so further processing can be added to improve the final error correction capacity. In the next section real results for implementation of SB-MS decoder for a code that does not show any error degradation without OSD-0 are included.

IV. IMPLEMENTATION RESULTS
In this section, hardware results for the SB-MS decoder are included. To perform the implementation and set up a lower boundary, the (255,32) QLDPC code was selected, as: i) it is the one with more restrictive parameters, from the hardware point of view (it has high and ) and; ii) it is one of the codes with better error correction capacity. These codes are designed following the methods described in [39].
The code parameters, =112, =10 and =5 are representative to evaluate the routing congestion of the design and compute a realistic max , not only limited by the logic involved in the critical path but also by the wiring, which can be used as a reference for the implementation of other simpler QLDPC codes.
About the performance, we show in Fig. 2 that the error correction capacity is good enough without OSD-0, and for this reason, it can work as a standalone solution.
The decoder was implemented on an FPGA device xcvu095ffva2104-2 using a hardware description language (VHDL) and Xilinx's Vivado Design Suite. A finite precision model was performed with Matlab. As it is shown in Fig. 5, the differences in error correction capacity between the quantized version with 6-bit messages and the full-precision one are almost negligible. The decoder was verified with ModelSim comparing the outputs of Matlab's golden model of 6-bit precision to the outputs of the hardware architecture.  The area results for the implementation are included in Table 3 and the layout can be found in Fig.6, where it can be seen that the decoder does not occupy the whole chip, allowing the wires between the different computational units and the logic to distribute optimally to reduce congestion and critical path, increasing speed. Proof of this fact is that the decoder can achieve an max =125 MHz (clock period of 8 ns, with 78.7% of delay due to the routing and 21.3% due to the logic), which is equivalent to a total latency of 320 ns @ max =20. This latency is slightly smaller (80 ns less) than the time budget reported in [13], making this parallel architecture a promising candidate for real implementations of QLDPC codes for the QEC step, as with two cores of this architecture it can correct and errors for dual and non-dual containing codes, just changing the wiring between processing units in the last case. It is important to highlight that other serial implementations that process each CNU at the time, will allow higher convergence in the decoding but would increase the total latency times, not fulfilling the time constraints of the quantum system. Derived from the latency results, throughput can be obtained as =( · max )/( · )=(256·125MHz)/(20·2)=800 Mbps. Note that other implementations with a larger number of pipeline stages would obtain a higher throughput, however, for these systems, the critical parameter is latency, not speed/throughput.
Concerning the dynamic power consumption, 0.63 W are spent on the signals between cells and 0.48 W is spent on logic, validating the claim that about half of the complexity is located on the wiring and the rest on the logic processors (CNUs and VNUs). The design has a static power consumption of 0.918 W. These results show that the proposed solution, with a total power consumption of 2.1 W, can be considered as a low-power co-processor to increase reliability of the quantum system.

V. CONCLUSION
In this work, the feasibility of the hardware implementation of the SB-MS algorithm and its concatenation with the OSD-0 to achieve the timing constraints required to avoid the qubit decoherence has been evaluated. The main conclusions are that a) it is not possible to implement the OSD-0 algorithm with the current or near-future VLSI technology mainly due to its sequential nature, and b) the SB-MS algorithm can be implemented accomplishing the timing requirement of QEC and is a solution with the potential to be scaled to a higher number of qubits. To reinforce our conclusions, an SB-MS decoder has been implemented in a Virtex FPGA device for the (255,32) QLDPC code achieving a latency of 320 ns with 20 iterations, which is within the required time budget. Under the best knowledge of the authors, this is the first real implementation reported for an iterative QEC syndrome decoder. Finally, given that OSD-0 is not a practical solution for QLDPC decoders, future research is focused on improving the performance of message-passing decoding algorithms to solve problems like error-floor, rather than trying to concatenate other postprocessing steps with higher complexity like OSD.
Future work will try to reduce more the maximum latency to obtain some extra margin in terms of timing, and thus prevent further limitations that may appear during the integration in a quantum system. Although the final objective of this research line is to integrate the error-correction step on a real system, nowadays devices do not work with the number of qubits that handle the decoders of this paper, so it will be necessary to wait for the next generation of quantum computers. In the meantime, more steps in this direction will be required to design the hardware implementations of error-correction solutions that meet both the latency and logical-error rates of future quantum computers.