A Latency-Effective Pipelined Divider for Double-Precision Floating-Point Numbers

In this article, we propose an effective algorithm of pipelined dividers for double-precision floating-point numbers. This reduces the latency of the previous pipelined dividers without increasing the lookup table size. The experimental results show that the proposed divider reduces the overall latency up to 16% compared to the previous divider.


I. INTRODUCTION
Generally, the latency of division is much longer than for other arithmetic operations. Division operations have a significant effect on total execution times, despite their low frequency of use [1]. Thus, the design of high-performance dividers has become an important issue in high-speed computing. Efficient pipelined dividers are required for 3D computer graphics, signal processing, and scientific computing applications where the performance depends heavily on division operations [2]- [6]. This efficient pipeline divider technology is widely used to reduce power consumption and latency in multi-media and signal processing applications as well as fast processing [7].
In modern microprocessors, improving performance by reducing the pipeline latency becomes more difficult than by increasing the hardware size. The main reason for the recent trend of using semiconductor technology is that the hardware size increases significantly, but the clock speed is congested. Moreover, reducing pipeline latency can lead to the removal of several hardware components. For example, if a particular application-specific processor, such as GPU, requires a large number of bits of the register per pipeline, we can remove these bits of the register by reducing the pipeline latency.
The associate editor coordinating the review of this manuscript and approving it for publication was Seok-Bum Ko .
In addition, the decision of the pipeline latency may affect the overall architecture, so that it may be necessary to fix the pipeline latency of the specific unit. For example, all 3DNow! Instruction in [8] has a fixed two-cycle latency to simplify the overall architecture. A high-radix pipelinable division algorithm based on Taylor series expansion was proposed in [9] and took the first two terms of the Taylor series. Although this algorithm had a smaller lookup table (LUT) than other approaches, the table for this algorithm was still large. To reduce the LUT size significantly, a modification to this pipelinable division algorithm was introduced in [4]. Compared to [9], this algorithm reduces the LUT size from 13 KB to 208 B for single-precision and from 470 MB to 56 KB for double-precision. A further modification for double-precision floating-point numbers was proposed in [5]; this significantly reduces the chip area in comparison to [4] mainly by reducing the LUT size from 56 KB to 2.5 KB. Similar to [5], [10] took the first six terms of Taylor series expansion for the approximation to reduce the size of the LUT.
The size of the LUT and the computational latency are the most important elements in the design of the highperformance divider. The divider architecture in [9] guarantees low latency with a relatively larger LUT. For the double-precision in particular, [9] requires a LUT of about 470 MB, which is impractically large. In the case of [4] and [5], the LUT size can be sufficiently reduced to implement double-precision with increased computational latency.
In this article, a novel high-performance divider architecture geared toward reducing computational latency with a similar LUT size to [4] and [5] is proposed. In [4] and [5], the multiplication step is performed after accessing the LUT, while in the proposed architecture, the LUT access and multiplication are processed in parallel using additional multipliers. In this way, the proposed architecture reduces the pipeline depth by one step compared to previous schemes. As a result, this can be expected to be the highest-speed double precision pipelined divider with a reasonable area among the high-performance divider introduced so far. It can definitely give users more choices to select the performance optimization option in selecting the divider.
Because all pipelinable division algorithms permit 1 ulp (acronym for unit in the last place) error, they do not fully support the IEEE floating-point standard [12]. This means that our pipelined divider does not perform the rounding and the remainder needs to be calculated.
We calculated the type of error that can occur in order to consider the accuracy required for double-precision in the proposed architecture. Through calculation, the optimal block size considering accuracy is obtained and applied to the proposed architecture. To verify the proposed hardware divider, we compared the proposed hardware divider with other pipelinable divider algorithms and traditional multiplication algorithms.
Similar to [4] and [5], we compare the proposed algorithm with other algorithms by using the delays and the area cost estimation. According to the comparison results, the latency of the proposed division algorithm is about 16% and 7% faster, with an area of about 21% and 45% larger than the algorithms in [4] and [5], respectively.
In the remainder of this article, Section 2 explains 2 related work and Section 3 explains the proposed algorithm and architecture. In Section 4, we perform an error analysis to determine the bit-widths of all blocks in the divider. We compare the proposed algorithm with others in Section 5, and conclusions are provided in Section 6.

II. RELATED WORK
In [9], Y is decomposed into two groups: the higher order bits (Y h ) and the lower order bits (Y l ). The operands X and Y are normalized m bit fixed-point numbers. Y h is the higher order p bits of Y and Y l is the lower (m-p) bits of Y. That is, Y h = 2 0 y 0 + → 2 −1 y 1 + → 2 −2 y 2 + · · · + → 2 −(p−1) y p−1 and Y l = Y − Y h . A division operation can be represented as follows: The variables in (1) are bounded as follows: Using Taylor series, (1) can be expanded at Y l /Y h as follows: Hung's algorithm in [9] combines the first two terms of the Taylor series for approximation, as shown in the next equation.
Hung's algorithm multiplies the dividend X by (Y h -Y l ) from the divisor and then multiplies the result by 1/Y 2 h from a LUT. As mentioned in [4] and [9], (Y h -Y l ) is computed without subtraction because it is calculated by modifying the Booth encoder of the multiplier [11]. The size of the LUT is estimated to be about 13 KB for single-precision and 470 MB for double-precision. Although it is a very simple architecture, it still requires a large (or even huge for doubleprecision) LUT.
Jeong's algorithm in [4] reduces the size of the LUT significantly. It determines a coarse quotientQ by executing (2) with a very small LUT; then the subdividendX is calculated by multiplying the divisor by this coarse quotient and subtracting the value from the dividend:X = X − YQ. With this subdividend,Q is obtained again using (2). The final quotient is obtained by adding the two calculated quotients. This algorithm can be summarized as follows: The hardware architecture of Jeong's algorithm is shown in Figure 1. It consists of a LUT and four multipliers (MULs) VOLUME 8, 2020 and requires four steps to complete its pipeline. The size of LUT is estimated to be about 208 B for single-precision and 56 KB for double-precision.
To reduce the size of LUT in [4] for double-precision, Singh's algorithm in [5] calculatesQ as well asQ andQ. This algorithm can be represented as follows: The hardware architecture of Singh's algorithm is shown in Figure 2. It consists of one LUT, five MULs, and a carry propagation adder, requiring five steps to complete its pipeline. Note that the multiplication and addition in the fourth step can be implemented by one multiply-and-add (MAC) operator. The size of LUT for double-precision is estimated to be about 2.5 KB. This algorithm reduces the chip area by about 81% compared to the proposed divider in [4].

III. PROPOSED DIVIDER ARCHITECTURE
To propose a latency-effective pipelined hardware divider, we did the following. First, we performed error analysis on the proposed divider architecture, case 1 and case 2, to get the basic data for determining the LUT size and MUL bit width. Second, we calculated the optimal block size by calculating the LUT size and MUL bit-width determined from the basic data. Finally, the divider architecture was proposed through the application of the optimal block size.
According to Figure 1 and Figure 2, the value A is produced in the second step after the LUT is accessed in the first step. The two operations are processed sequentially, which increases the total computational latency. The proposed architecture parallelizes the LUT access and multiplication to reduce the total latency. Equation (3) of Jeong's algorithm can be rewritten as follows in order to exploit the parallelism: In (5), the LUT access for 1/Y 2 h and multiplications (Y h -Y l )X and (Y h -Y l )Y can be parallelized. Figure 3 illustrates the procedural steps based on the (5). The latency of Jeong's algorithm is 1 LUT + 3 MULs, while the latency of the proposed algorithm is 3 MULs. The proposed algorithm requires one more multiplier compared to Jeong's algorithm.
Equation (4), based on Singh's algorithm, can be rewritten as follows to exploit the parallelism: The LUT access for 1/Y 2 h and the multiplications (Y h -Y l )X and (Y h -Y l )Y can be performed in parallel. The proposed procedural steps to exploit the parallelism are depicted in Figure 4. The latency of Singh's algorithm is 1 LUT + 4 MULs, while the latency of the proposed algorithm is 4 MULs with one more multiplier.

IV. ERROR ANALYSIS
The error analysis of the proposed algorithm is essential for the design of the hardware divider, because it provides base data to determine the size of the LUT and the bit-widths of the MULs. In this section, error analysis for the two cases is carried out and then the optimal size of each block is obtained.

A. ERROR DUE FOR CASE 1
There are four types of errors associated with Figure 3. The first error is caused by the restriction in the number of entries in the LUT, which is generated by limiting the bit-width of Y h to p. This error can be calculated by subtracting the result of (5) from the ideal quotient. The second error is caused by the bit-width restriction of the LUT, which determines its accuracy. The third error is caused by the rounding positions of the MULs. Finally, the last error is caused by the bit-inversion when (2-AY) is calculated, which is always 1 ulp. Note that the first, second, and fourth errors are the same as those of [4].

1) FOUR TYPES OF ERRORS
The first error of the proposed method is the same as that of [4], and thus it is represented as follows: The table-entry error E TE is obtained by subtracting the actual quotient from the ideal quotient, which is calculated by assuming that the bit-widths of X and Y are infinite and that the precisions of the MULs are infinite.
When the bit-width of the LUT is q, the values to round the upper q bits of 1/Y 2 h are stored in the LUT. Thus, the second error E TB is as follows: There are five MULs in Figure 3. As in [4], the roundto-zero mode is used to eliminate an adder for rounding in each MUL. When the output bit-widths of the MULs are m1, m2, m3, m4, and m5, the rounding errors are as follows: An error of 1 ulp is generated when (2-AY) is calculated. This bit-inversion error E BI is dependent on the output bit-width of multiplier M4. Thus, E BI is as follows:

2) TOTAL ERROR
To calculate the total error, it is important to know how each error is propagated. In the first step, three operations of the LUT access for 1/Y 2 h , and multiplications (Y h -Y l )X, and (Y h -Y l )Y are accomplished in parallel. The bit-width restriction error in the LUT, the rounding error in M1, and the rounding error in M2 are generated for each operation. Thus the result of each operation including error (IE) is as follows: In the second step, two multiplication operations to calculate AX and AY are performed, so that the results for each operation including errors are as follows: In the third step, the bit-inversion error is accumulated in calculating (2-AY). Thus the result of the third step including errors is as follows: According to (5), the final quotient Q is calculated by AX and (2-AY) in multiplier M5. The final quotient including errors is calculated as follows: The second-order error terms can be ignored because the terms are very small. The total error E total can then be calculated as follows: In the above equation, E TE , E M1 , E M3 , E M5 , and E BI are positive errors, while E TB is a negative error and the term VOLUME 8, 2020  [4], [5].
including E M2 and E M4 are always negative. Thus the maximum positive error is determined when E TB , E M2 , and E M4 take their minimum values and the others take maximum values. The maximum negative error is determined in the opposite case. According to the case in [4], we can derive the following equation by approximating X = 2, Y = 1, Thus, the maximum positive error and maximum negative error are as follows: According to the selection processes of [4], we can obtain p = 15, q = 28, m1 = 56, m2 = 56, m3 = 56, m4 = 57, and m5 = 53 for double-precision.

B. ERROR DUE FOR CASE 2
The error analysis on the proposed algorithm is performed based on the method in [5]. For the detailed procedures, refer to [5]. The final formula for the maximum positive error and maximum negative error are as follows: Considering the accuracy requirement for doubleprecision, we can obtain p = 11, q = 19, m1 = 57, m2 = 55, m3 = 57, m4 = 58, m5 = 57, and m6 = 53.

V. COMPARISON WITH OTHER ALGORITHMS
In this section, the characteristics and performances of the proposed algorithm are compared with respect to the previous pipelinable division algorithms and the traditional multiplicative algorithms. The delay and area cost have been calculated based on the analytical method in [13] and [14]. The delays are expressed in terms of τ , which is the delay of a complex gate such as one full adder. The unit employed for the area cost estimation is the size of one full adder, fa.

A. COMPARISON WITH PREVIOUS PIPELINABLE DIVISION ALGORITHMS
The comparison, in terms of the delay and area cost of the proposed scheme in relation to previous pipelinable division algorithms, is provided in Table 1. Total delay (or total area cost) for each algorithm is calculated by adding the delay (or area cost) of each pipeline step.
According to the results in Table 1, the proposed algorithm (Case 1) could reduce the critical path delay by about 16%, with a 21% larger hardware area compared to [4]. The proposed algorithm (Case 2) could reduce the critical path delay by about 7%, with a 45% larger hardware area compared to [5].
The proposed algorithm (Case 1) can reduce the pipeline depth by one step compared to [4], while the proposed algorithm (Case 1) and [4] have the same value for the longest delay (12.0τ ) within a pipeline. This result means that the proposed algorithm (Case 1) can improve the pipeline latency by 25% compared to [4] with the same clock frequency. This result can be applied to the case of the proposed algorithm (Case 2), which can improve the pipeline latency by 20% compared to [5] with the same clock frequency. Compared to [4], the proposed algorithm (Case 2) has the same pipeline depth as [4] and could reduce the area cost by 28% with an 8% slower clock frequency.  [4], [5]. The actual synthesis of the proposed architecture was performed using 28nm process with Synopsys design compiler for the accurate evaluation and the implementation result was compared with those of other designs [4], [5]. Table 2 shows the comparisons, in which the Case 1 of the proposed algorithm reduced the pipeline latency by 22% with an area increase of 34% compared to [4]. As for the Case 2 of the proposed algorithm compared to [5], its reduction of the pipeline latency about 11% with an area increase of 33%.
When the results shown in Table 2 are compared with the results in Table 1, the delay time reduction rate of Case 1 was increased by 6% from 16% to 22% compared to [4], and the hardware area was increased by 13% from 21% to 34%. The delay time reduction rate of Case 2 increased by 4% from 7% to 11% compared to [5], and the hardware area decreased by 12% from 45% to 33%. As a result of comparing the two tables, the hardware area sizes in Tables 1 and 2 were equally increased by an average of 33% compared to the previous algorithms [4] and [5].
In general, power consumption is correlated with area and is proportional to area. Therefore, area-timing products (ATP) using area and timing values are widely used in fields similar to this article and are used as indicators of power consumption through analysis of ATP results [17], [18]. Therefore, we compared and analyzed the previous algorithms through the ATP model in relation to power consumption. Table 4 shows the comparison of the ATP results of the proposed algorithm with the previous algorithms [4] and [5]. In proposed cases 1 and 2, ATP was calculated by classifying LUT and multiplier in the first step. This is for accurate calculation because the delay time and power consumption of the two units are different. As a result of ATP analysis, it was confirmed that the power consumption of the proposed pipeline divider increased by about 37% which is similar to the increased hardware size of 33% compared to the previous algorithms. Through these results, it was confirmed that the area and power consumption are proportional.
Among previous works the pipelined division algorithm was used in [7]. There could be many differences between the proposed design and [7]. The main differences are how to calculate fractions and whether to perform error analysis on the accuracy of double-precision. In [7], since floating  [4], [5]. point arithmetic is used as a fraction calculation method, the hardware size may be larger or the operating frequency may be slower than the proposed method using fixed point arithmetic. Moreover, as the most important part, error analysis was performed in [4] and [5] including this article, but it was not performed in [7].

B. COMPARISON WITH TRADITIONAL MULTIPLICATIVE ALGORITHMS
A multiplicative algorithm uses hardware integrated with a floating-point multiplier and LUT and calculates a quotient by approximation. Newton-Raphson and seriesexpansion algorithms are well known multiplicative algorithms. An important disadvantage of these algorithms is their long latency; they are also difficult to be pipelined due to their iterative execution. For example, the LUT for a 16-bit seed Newton-Raphson or series-expansion divider is 64 KB [15]. These 16-bit seed dividers execute a division operation with three iterations in double-precision.
The comparison with the pipelinable division algorithms and traditional multiplicative algorithms is well discussed in [4]. Based on the comparison result of [4], we compare the proposed algorithm with traditional multiplicative algorithms, as shown in Table 3. Similar to [4], the function units of Newton-Raphson, series-expansion, and accurate quotient approximation algorithms are shown in the tables of [15] and [16].
The proposed algorithm has a larger area cost than Newton-Raphson, series-expansion, and accurate quotient approximation algorithms, but has a shorter latency and can be pipelined. Therefore, the proposed algorithm can be used effectively in systems where the divider is often used.

VI. CONCLUSION
In this article, a novel low latency pipelined divider architecture for double-precision numbers was proposed. We were able to reduce the pipeline depth by one step compared to the previous schemes. The proposed algorithm has been applied to the two conventional divider architectures and could reduce computational latency without increasing the LUT size. The proposed divider is suitable in systems where double-precision division is frequently required.