Design of Leading Zero Counters on FPGAs

This letter presents a novel leading zero counter (LZC) able to efficiently exploits the hardware resources available within state-of-the-art FPGA devices to achieve high-speed performances with limited energy consumption. Post-implementation results, obtained for operands bit-widths varying between 4- and 64-bit, demonstrate that the new design improves its direct competitors in terms of occupied lookup tables (LUTs), power consumption, and computational speed. As an example, when implemented using the Xilinx Artix-7 xc7a100tcsg324 device, the new 64-bit LZC utilizes up to 36% less LUTs, dissipates up to 2.8 times lower power and is up to 20% faster than state-of-the-art counterparts.


I. INTRODUCTION
E FFICIENT hardware implementations of leading zero counters (LZCs) are required in several applications, such as the floating-point arithmetic computations [1], [2], the conversion of floating-point data to other formats [3], the design of mixed-precision computational units [4], the quantization of deep neural networks (DNNs) [5], and the probabilistic approximate computing [6], just to cite some representative examples.
In recent years, field-programmable gate arrays (FPGAs) have evolved into hardware implementation platforms adequate to support the computational demands of the abovecited applications. On the one hand, many researchers focus their efforts toward the design of complex computational data-paths [7], [8], while on the other hand it is of interest to design basic computational modules, such as adders [9], multipliers [10], and LZCs [11], [12]. Typically, novel complex data paths are designed to utilize the advanced resources on-chip available within an FPGA device, such as digital signal processors (DSPs) and intellectual property (IP) cores, in the best possible manner. On the contrary, to make gate-level innovative designs effective, basic computational modules are designed to use the logic resources based on lookup tables (LUTs), fast carry-chains (FC), and flip-flops (FFs) as efficiently as possible.
This letter presents a new FPGA-based design for LZCs. The architecture here described utilizes LUTs more efficiently than previous designs demonstrated in [11] and [12] and exhibits significantly reduced hardware resources requirement, power consumption, and computational delay. This is a graceful result, given that, as a part of the critical computational path, the LZC can contribute up to 30% to the worst-case delay of a floating-point unit [13] and up to 15% to the resources utilization [14].
The new LZCs have been implemented and evaluated using the Xilinx Artix 7-series xc7a100tcsg324 [15] and the Altera Cyclone 10 LP 10CL006YE144A7G [16] devices. In both cases, obtained results clearly show the benefit of the proposed approach over its competitors.

II. BACKGROUND AND RELATED WORKS
An LZC is a basic computational module able to count the number of consecutive zeros (or ones) within a binary input, starting from its most significant bit (MSB). When an n-bit binary number A (n−1:0) is processed, the LZC provides log 2 (n) + 1 output bits, one of which (typically called V) flags that all the n input bits are equal to zero, while the remaining bits [usually named Z (log 2 n−1:0) ] represent the number of counted zeros. As an example, in the case of the 8-bit input A = 00000011, an LZC furnishes V = 0 with Z (2:0) = 110.
Several methods exist to determine the leading zero count. In the following, we refer to the FPGA-based implementations [11], [12] and to the approach presented in [17], that being originally developed for ASIC designs, was replicated in FPGA to extend the comparison with the new method. The basic logic exploited in [11], [12], and [17] for the design of 8-bit LZCs is summarized in the truth table of Fig. 1. Wider leading zero counters are implemented by combining several instances of the 8-bit LZC into a hierarchical structure, as shown in Fig. 1(a) and (b) for the 32-bit LZCs presented in [11], [12], and [17], respectively. It is worth noting that [11], [17] use the same hierarchical structure, but, as depicted in the insets of Fig. 2(a), their 8-bit LZCs employ quite different logics, that obviously lead to different hardware characteristics. From Figs. 1 and 2(b) it can be observed that the logic implemented in [12] is completely different from [11] and [17]  has been purposely tailored to Xilinx's FPGA fabric available in the series seven devices [15].
As an alternative to the above architectures, the FCs available within modern FPGA devices may be exploited as shown in [18]. For purposes of comparison, also FC-based LZCs are characterized in the following.

III. PROPOSED DESIGN
This section introduces the new approach here proposed to design LZCs on FPGAs. It differently treats the condition in which all the input bits are equal to zero, based on the consideration that when V = 1 the count value Z does not matter at all. However, if a specific value of Z is required in such case, the proposed approach does not require much different additional logic as that required by traditional approaches. With respect to the previously described designs, the proposed method exploits a different granularity. In fact, it uses the 2-bit LZC, instead of the 8-bit one, as the basic block. Consequently, it requires deeper hierarchical architectures to construct wider LZCs. As shown in the following, these choices lead to an LUTs utilization more efficient than [11], [12], and [17], even without applying any optimization process to keep a specific device structure into consideration. Fig. 3(a) shows the hierarchical structure of the new 8-bit LZC based on two instances of the 4-bit LZC, each being in turn constructed using two instances of the basic 2-bit block whose outputs are combined by four auxiliary gates. The same auxiliary logic is utilized to construct the 8-bit LZC by combining the results obtained from two 4-bit LZCs, and so on for even wider operands. In this letter, n-bit LZCs have been designed and characterized, with n varying from 4 to 64. The n-bit LZC consists of log 2 (n) hierarchical levels. The first one is composed by (n/2) instances of the 2-bit LZC and implements (1) Fig. 2. LZC architectures used in (a) [11] and [17] and (b) [12].
It is worth underlining that the proposed 8-bit LZC complies with the third column of the truth table shown in Fig. 1. In fact, in the proposed logic, the case of all zero bits causes the flag V and the output bits Z (log 2 (n)−1:1) to be asserted, while Z 0 is zeroed. This behavior allows simplifying the overall logic.
The proposed LZCs have been described using the very high-speed integrated circuits hardware description language (VHDL) to be then synthesized and implemented within an FPGA device. Fig. 3(b) shows how the VHDL description of the proposed 8-bit LZC can be implemented within only three 6-input LUTs. The LUT L0 is configured to perform in parallel the 5-input and the 4-input logic functions producing V 2 0 and Z 1 0 . Analogously, L1 computes the signals Z 2 0 and Z 2 2 by means of the 5-and the 4-input LUTs, respectively. Finally, L2 is configured as one 6-input LUT to compute Z 2 1 . The logic functions implemented in each LUT are obtained as follows.

IV. IMPLEMENTATION RESULTS
The new LZCs have been implemented using the Xilinx Artix 7-series 28-nm xc7a100tcsg324 FPGA device [15]. They have been characterized in terms of occupied LUTs, FCs (Carry4), computational delay (D), and dynamic energy consumption (E). Table I summarizes post-implementation results obtained in comparison with other LZCs, including the builtin high-level-synthesis (HLS) leading zero counting function characterized in [12]. The new designs utilize less LUTs, are faster and dissipate lower dynamic energy than the designs [11], [12], [17] and HLS. These benefits come from the deeper hierarchy exploited in wider LZCs and the simplification introduced to treat the case in which all the input bits are equal to zero. As an example, when n=64, the new LZC uses 37.3%, 21.7%, 33.8%, and 35.6% less LUTs than the designs [11], [12], [17] and HLS, respectively. Moreover, it is 16.2%, 23%, 13.2%, and 19.4% faster and dissipates 3.37, 2.89, 3.1, and 3.2 times lower energy.
In comparison with the FC-based designs, the new LZCs always save significant amounts of resources and dissipate up to ∼6.5 times lower energy at the expense of a delay at most only ∼13% worse.  I  POST-IMPLEMENTATION RESULTS RELATED TO THE  XC7A100TCSG324 DEVICE   TABLE II  EVALUATIONS FOR THE 10CL006YE144A7G DEVICE The proposed LZC architecture overcomes its direct competitors also in terms of the cost function EnDeLUC defined in (10). It can be seen that, with n = 8, higher improvements are achieved with respect to [11] and [17]. Then, as n increases to 16 or greater, also the advantage over the designs [12], HLS, and FC-based becomes quite evident Note that the HLS built-in function occupies less LUTs and dissipates less energy than [11]. Indeed, while HLS designs are mapped on cascaded LUTs, the LZCs [11] exploit parallel logic that leads to slightly lower delays. Results obtained from a leading zero detector, used as a simple toy circuit, have shown that out of 359 LUTs, 7.29 ns of maximum delay and 31 pJ of energy consumption, the contribute of our 32-bit LZC is 6.7%, 31%, and 2.9%, respectively.
LUT requirements and computational delays have been evaluated also for the Altera Cyclone 10 LP series 60-nm 10CL006YE144A7G device [16] at the 1.2-V Slow Corner Model @85 • C. This device has been chosen since it provides 4-input LUTs. Table II shows that, in comparison with the new designs, the best implementations among [11], [12], and [17] are up to 13.5% slower and utilize up to 17% more LUTs.

V. CONCLUSION
This letter presented new designs of LZC that use the 2-LZC as the basic block and adopt a different way of dealing with the case in which all the input bits are zero. In comparison with state-of-the-art competitors, the new LZCs are cheaper, faster, and consume significantly lower energy. As a further result, the efficiency of the proposed designs has been demonstrated referring to different FPGA devices.