A Practical Energy/Power Reduction Approach for Parallel Decimal Multiplier

Decimal computation is highly demanded in many human-centric applications such as banking, accounting, tax calculation, and currency conversion. Hence the design and implementation of radix-10 arithmetic units attract the attention of many researchers. Among the basic decimal arithmetic operations, multiplication is not only a frequent operation but also has high complexity and considerable power consumption. Therefore, this paper concentrates on this issue and studies a general design methodology that reduces power/energy consumption via localizing switching activity without compromising target performance. This method decomposes a digit multiplier into smaller ones, like Karatsuba’s algorithm, while the multiplicand and the multiplier can be partitioned into different sizes. We take advantage of various size partitions in two types of symmetric and asymmetric, which final designs provide specific characteristics. All designs were implemented using VHDL and synthesized in the Design-Compiler toolbox with TSMC 130 nm Technology file. The results are significant; in respect to the original design, by a random test vector, 25% power reduction is achieved with no effect on latency and a negligible area penalty. Moreover, the experimental results indicate a potential for considerable reduction of power dissipation based on statistical properties of possible input data.


I. INTRODUCTION
Despite the fast and effective implementation of binary arithmetic functions, using decimal computation has been revived. This revitalization has been done for three main reasons; (1) the advances in VLSI technology, (2) the appearance of a large amount of decimal data in humancentric applications such as financial, commercial, scientific, and internet-based applications so that the software implementations do not satisfy the high-performance requirements [1], and finally (3) the lack of exact binary representation for some decimal fractions (e.g., 0.2). The former made hardware realization of complex functions possible, while the two others pushed the designers to use hardware-implemented decimal arithmetic units for coping with the complexity of processing a massive amount of data with acceptable precision and time. Due to the importance of decimal arithmetic, the new feature of decimal representation and related operations is added to the latest revision of IEEE 754 standard for Floating-point arithmetic [2,3]. Moreover, concerted activity in both industry and academia is progressing on decimal arithmetic. Several processors have been announced, such as IBM eServer z900 [4], IBM POWER6 [5], and IBM z10 [6], which are equipped with a dedicated decimal arithmetic unit. On the other hand, a considerable number of research papers have been published on decimal arithmetic algorithms and hardware units such as decimal addition (two-operand (e.g., [7]) and multi-operand (e.g., [8])), decimal multiplication (sequential (e.g., [9]) and parallel (e.g., [10])), decimal division (subtractive (e.g., [11]) and multiplicative (e.g., [12])), and other arithmetic functions (e.g., [13]). Among various operations, decimal multiplication is known as one of the most complex operations, which is high frequency, time-consuming, and power-hungry. In addition, it is iteratively used to implement other useful operations such as division, square root, and function evaluation circuits like radix-10 exponentiation and logarithm. Hence, high-speed decimal multiplication takes particular attention, and several publications have been published on this topic in less than ten years. The proposed methods of these articles are focused on latency and area as the main design parameters, while the power/energy consumption is neglected. Whereas, the cooling issue in high-performance processing systems and the limited power budget in embedded systems, as well as the effects of consumed power in the efficiency and reliability of digital circuits, make the power/energy consumption the most challenging parameter for nowadays hardware designers [14]. In [15], we provide a comparative study on the leakage and dynamic power consumption of released high-speed decimal multipliers and suggest some guidelines for EDA tools and hardware designers. In this paper, we target to reduce power/energy consumption in high-speed decimal multipliers (i.e., parallel decimal multipliers). To achieve the best result, we use the highest level of design abstraction for applying power reduction techniques. Since power optimizations at the higher levels of design abstraction have the maximum influence. Our suggested method is based on partitioning (like Karatsuba's algorithm) and reduces power/energy consumption by localizing switching activity without any negative impact on performance. In our proposed method, we implement a digit multiplier (i.e., IEEE 754-2008 standard) via smaller ones in two different types of symmetric and asymmetric. In symmetric designs, all smaller multiplier units have the same size, while different sizes of multipliers may be used in asymmetric ones. The rest of this paper is organized as follows. In Section 2, as a background, we briefly describe the structure of parallel decimal multiplier and provide a theoretical foundation of power consumption, which paves the way for the following sections' discussions. The proposed method based on two types of symmetric and asymmetric partitioning is presented in Section 3. The essential criteria for selecting primary multiplier units (i.e., the small parallel decimal multipliers) and chosen decimal multiplication algorithm are discussed in Section 4. In Section 5, we explain the design flow of the synthesis tool to generate the reports of power consumption and experimental results of various implementations, where power comparison with original design is also provided. Finally, we conclude the paper in Section 6.

II. BACKGROUND
In this section, the theoretical background in two separate parts is concisely discussed. At first, the general structure of parallel decimal multipliers and the previous related works are explained. Then, in the second part, the requirement foundations about power/energy consumption are described. This information paves the way for a detailed discussion of the following sections.

A. PARALLEL DECIMAL MULTIPLIERS STRUCTURE
Decimal multipliers, like their binary counterparts, have three main steps, which are called partial product generation (PPG), partial product reduction (PPR), and the final addition (or redundant to not-redundant conversion). However, decimal multiplication is more complicated than binary multiplication in the entire aforementioned steps. The PPG in binary multiplication can be done by a simple AND-gate matrix. However, due to the wider range of decimal digits, in decimal multiplication, various techniques like lookup tables, decimal digit-multipliers, or pre-computed multiples must be used to provide various multiples of the multiplicand. Moreover, due to binary logic and BCD encoding of decimal numbers in the decimal implementations, for compensating the difference carry value in decimal and binary, all the decimal add operations in the PPR and final addition needs a correction step. Furthermore, decimal multipliers like binary ones can be designed in a variety of ways, namely sequential [9,16,17], parallel [10,18,19,20], and array [21], which offer area and speed trade-offs. In the former, the partial products are iteratively generated one after another based on multiplier digits. Then, each generated partial product is aligned and added to previous accumulated partial products. Via this technique, the product may be delivered in the minimum area; however, it is time-consuming and cannot satisfy the quest for high-performance processing. On the contrary, the parallel multipliers present low latency; however, not surprisingly, the area and power/energy consumption of these units are rather significant. Thanks to the astonishing advances in VLSI technology, the huge area challenge is alleviated. Nevertheless, power/energy consumption remains the most challenging issue for hardware designers [14]. In a parallel decimal multiplier, all partial products are generated simultaneously in the PPG step. As mentioned above, various methods exist for generating each partial product; using pre-computed multiples is dominant. In the naïve implementation, all the possible multiples of multiplicand (i.e., {0 , 1 , . . . , 9 }) are needed. However, in practice, just a limited subset of multiples of the multiplicand, called primary multiples, are generated (e.g., {1 , 2 , 4 , 5 } or {±1 , ±2 , 5 , 10 } ). The primary multiples can be generated in constant time. The other multiples are computed by using the primary multiples (e.g., 9 = 4 + 5 or 9 = 10 − 1 ). For constructing the PPG matrix (i.e., aligned all the partial products), the multiplier digits are recoded and these recoded values are used for selecting proper multiples. This process is shown in the upper part of Fig. 1. The second step of parallel multiplication, partial product reduction (PPR), can be considered as a decimal multioperand addition. In PPR, via a reduction tree, the final product in a redundant representation is computed. Finally, in the last step, a redundant to non-redundant conversion is done. Since the output of the PPR step is usually presented in two equally weighted numbers and the final product is achieved with a carry propagation addition (or conversion), traditionally, the last step is called final addition. The general architecture of a parallel decimal multiplier is shown in Fig. 1. The first parallel decimal multiplier was implemented in [18]. After that, several designs were been proposed in a period of fewer than ten years. These parallel decimal multipliers are partially or entirely distinguishable in the three main steps (i.e., PPG, PPR, and final addition), except that inputs and output are represented in binary coded decimal (BCD) encoding.

B. ENERGY/POWER CONSUMPTION
The progress of increasing the number of transistors and clock frequencies to quench the high demand for more functionality in smaller and more effective portable devices makes power consumption one of the most critical issues in digital systems design [14]. In digital CMOS circuits, the power consumption ( ) comes from two components, namely static ( ) and dynamic ( ). The sub-threshold leakage through OFF transistors ( ), gate leakage through gate dielectric ( ), and junction leakage from source/drain diffusions ( ) are the main parts of static power that consume energy without any switching activity in the circuit [28]. On the other hand, dynamic power sources are charging/discharging load capacitances besides the short-circuit current of partially ON both pMOS and nMOS stacks. Therefore, based on the above explanation, the total power consumption in CMOS circuits can be determined by Equation (1).
Besides the declared currents ( , , and ), is the supply voltage, is the probability of switching the signal, is the switching capacitance, and is the operational frequency. In designing a low-power circuit, reducing any of the parameters mentioned above can be desirable; however, they strongly depend on the synthesis technology, except the probability of switching ( ). Therefore, designers focus on which has a direct impact on . There are several lowpower designs (e.g. [29], [30], and [31]) which take advantage of switching activity as a key parameter for saving power. The most commonly used switching activity reduction technique is clock gating which can be applied at RTL and gate-level of design abstraction. However, this technique is suitable for sequential design which reduces toggle rates on registers and cannot be applied for a fully combinational circuit. Moreover, power optimizations at the algorithm level (i.e., arithmetic algorithm in this paper), the highest level of design abstraction, provide more choices for the designer to modify the design without compromising target performance or area. Therefore, in this paper, we concentrate on the algorithm level of design abstraction to reduce switching activity by localization. Based on our observations, the primary source of power consumption in parallel decimal multipliers is dynamic power, while the static power is negligible. However, static power consumption has grown exponentially due to scaling down to nanometer technology, and considering static power reduction techniques will be mandatory in the near future. It is worth mentioning that the proposed approach also paves the way for applying static power reduction techniques such as power gating. Since reducing the operational frequency ( ) results in lower power consumption, just power consumption cannot be a convincing metric. So, in this paper, we use power-delay product or energy consumption for comparing different designs.

III. THE PROPOSED METHOD
The Karatsuba algorithm is a well-known technique for reducing the delay of large numbers multiplication which is based on divide and conquer. According to this algorithm, several high-speed multipliers have been proposed [32], [33]. However, this paper focuses on power reduction of decimal multiplication via partitioning, like the Karatsuba technique. It uses smaller multipliers (i.e., multiplier cells) and provides appropriate granularity for localizing switching activity. Nevertheless, some issues like the size of utilized multiplier cells and the basic multiplication algorithm should be studied carefully. The former is described in the following section, after a short overview of the decimal Karatsuba algorithm, and the latter is explained in the next section.

A. DECIMAL MULTIPLICATION VIA PARTITIONING
As mentioned above, the Karatsuba algorithm, via divide and conquer, accelerates the multiplication of large numbers. It was originally implemented for binary multiplication [34]. Based on this algorithm in decimal multiplication, for computing = × where = ∑ 10 , the operands can be recursively divided into two part , and , , respectively. In this regard, and show the most significant parts, while and demonstrate the least significant portions. Thus, by assuming = 2 and equal partitioning, the operands can be rewritten in the form of Equation (2) = 10 + (2) After partitioning, based on the Karatsuba multiplication algorithm, the = × product is constructed by Equation (3).
= ( 10 + )( 10 + ) = 10 2 + 10 ( + ) + (3) Implementation of Equation (3) needs four × digit decimal multiplications and two 2 -digit and 3 -digit additions. This algorithm can be recursively continued till it achieves to 1 × 1digit multiplication. Therefore, the abstract architecture of decimal multiplication based on partitioning can be illustrated in Fig. 2. In the above explanation, equal partitioning is assumed, which causes uniformity of the smaller multiplier cells (i.e., symmetric). However, in practice, different sizes of multiplier cells are possible (i.e., asymmetric). Since we want to consider all promising configurations and find the best one, based on statistical properties of possible inputs, we study the implementation of a 16 × 16 digit multiplier (i.e., IEEE 754-2008 standard) in two different types of symmetric and asymmetric in the following sub-sections.

B. SYMMETRIC APPROACH
For symmetric designs, we consider three different partitioning sizes of operands, namely, two, four, and eight. In the first design, multiplier and multiplicand are partitioned into two equal parts, so four 8 × 8 -digit multipliers are used to create a 16 × 16 -digit multiplier according to Equation (3). The output of multiplier cells must be aligned before the final reduction and conversion steps. This alignment is shown in Fig. 3, where the eight bits of the least significant part, without any extra computation, form the least significant part of the final result. However, the 8th to 23rd positions need a multioperand addition. Also, an increment operation is required for the eight bits of the most significant part (increment with carry out of multi-operand addition). It should be mentioned that the algorithm of multiplier cells is essential since it has a significant impact on the detail design and low-level architecture of the main multiplier. Thus, the discussion about the multiplier cells and the related influential characteristics of the algorithm are presented in Section 4. The idea of divide and conquer can recursively be used to implement each multiplier cell with smaller ones. For example, a 4 × 4 -digit multiplier needs four 2 × 2 -digit multipliers, as mentioned above. By more partitioning, each 2 × 2 -digit multipliers can be implemented via four ×digit multipliers. According to these explanations, we can implement a 16 × 16-digit multiplier by using sixteen 4 × 4digit multipliers, as shown in Fig. 4. The extreme point of partitioning provides a parallel multiplier which its partial products generate via digit-by-digit multipliers. Since all the state-of-the-art multipliers use precomputed multiplies (based on background information), we define the 2 × 2-digit multiplier as the smallest multiplier cell. According to this definition, we can partition operands into eight equal parts and construct a 16 × 16-digit multiplier by using sixty-four 2 × 2-digit multipliers, as shown in Fig. 5. By reviewing the aforementioned architectures, two important issues should be considered. The first is the depth of the multioperand adder, and the second is the output format of multiplier cells. The depth of the multi-operand adder in Fig.  3, 4, and 5 are 3, 7, and 15, respectively. Obviously, using smaller multiplier cells provides more granularity. However, it increases the area consumption. So, a detailed analysis with power dissipation monitoring is required to compromise the size of multiplier cells and area consumption. To address the second issue, based on Fig. 2, the output of multiplier cells is in BCD format, which is provided after a redundant to non-redundant conversion. It causes an unnecessary time-consuming carry propagation that can be omitted since the output of multiplier cells feed to a multioperand adder. Of course, this adder has to accept redundant inputs. This issue is related to the algorithm of multiplier cells, which is presented in Section 4.

C. ASYMMETRIC APPROACH
As mentioned before, different sizes of multiplier cells are possible. Although the bigger multiplier cell has more delay, it may increase the total delay of the main multiplier due to imbalance delay paths. Moreover, an asymmetric method can show considerable superiority for an input pattern with special statistical properties (The experimental results are provided in Section 5). Since possible partitioning for asymmetric design is many, this category's advantage is shown by studying just a straightforward partitioning to illustrate a specific pattern of power consumption in our proposed method. In Fig. 6, the arrangement of the 16 × 16-digit multiplier is shown, which is partitioned asymmetrically. This multiplier is composed of various multiplier cells (i.e., four 4 × 4-digit multipliers, two 4 × 8 -digit multipliers, two 8 × 4 -digit multipliers, and one 8 × 8-digit multiplier).

IV. MULTIPLIER CELLS ALGORITHM AND MULTI-OPERAND ADDER STRUCTURE
This section discusses the multiplier cells algorithm and multioperand adder structure, which processes the result of   Among the mentioned characteristics, RPDS and RPDE are most important since they reflect the output format of multiplier cells. The reduced partial product of all designs needs five or more bits for representation, except [20], [25], and [10] which are representable just by four bits (Note that more bits mean more area and power consumption). The signed outputs of [25] and [10] cause difficulty in the multioperand adder. Moreover, the structure and algorithm of multi-operand adders should be totally different from the utilized structure and algorithm of reduction parts of multiplier cells. However, in [20], not only the outputs of multiplier cells are unsigned, but also the same structure is used in all the reduction levels. In this design, , 2 , 4 , and 5 (PMV) are generated parallelly in a carry-free manner. Then, for partial product reduction, a novel 2 to 1 reduction module (i.e., ODDS adder) is used, which provided output digit set (RPDS) and encoding (RPDE) are [0,15] and overloaded decimal, respectively. Therefore, a redundant to non-redundant conversion unit converts [0,15] to [0,9] with BCD encoding at the end of reduction. In this paper, the algorithm of multiplier cells and structure of multi-operand adder is the same as [20]. In other words, the smaller multipliers (i.e., multiplier cells) that are 4 × 4, 4 × 8 , 8 × 4 , and 8 × 8 and 2 × 2 -digits imitate the proposed 16 × 16 digit multiplier of [20], just the final conversion, as explained at the end of Section 3.2, is removed. It is worthy to note that this technique is general and can be applied to every decimal multiplier. However, as mentioned above, we select [20] algorithm due to its flexibility for providing the different sizes of multipliers. Moreover, the prosperity of this algorithm in terms of power consumption is studied in previous work [15].

V. EXPERIMENTAL RESULTS
In this section, we discuss how to implement, test, verify, and synthesize different designs. After that, we compare the power/energy consumption of various designs and the impact of the proposed technique on power/energy and area.

A. DESIGN AND IMPLEMENTATION
We modeled all designs with VHDL. We used a top-down methodology to manage the complexity of the hardware. After describing the hardware, they are verified by using a large number of test vectors, especially corner cases. For test and simulation, ModelSim toolbox is used.

B. LOGICAL SYNTHESIS AND POWER EXTRACTION FLOW
In order to achieve synthesis results with maximum accuracy, we used Synopsys Design Compiler suite and TSMC 130nm technology for synthesizing designs. At first, we synthesized    designs for generating net-lists and ".saif" files, which the critical path delay constraints 10 ns. Then, in the second step, we used a ".do" file in ModelSim, which annotates created net-list in the previous phase and saves the annotated file in ".back-saif". Finally, we used the back-saif file for producing detailed power consumption reports. Fig. 7 shows the data flow and different phases of synthesis by Design compiler. It should be mentioned that all designs have met the 10 ns time constraint.

C. SYNTHESIS RESULTS
Tables II and III contain synthesis results of the original multiplier [20] and proposed architectures based on partitioning. In these tables, the detailed power consumption (i.e., Switching power, internal power, leakage power, total power consumption, and the ratio with respect to the original design) under various scenarios are summarized.  The last one presents the results of asymmetric architecture, which contains one 8 × 8, four 4 × 4, two 4 × 8, and two 8 × 4-digit multipliers. The block diagrams of these designs were shown in Fig. 3 to 6. As two extreme cases, in the first scenario, all the bits of multiplicand and multiplier are active (they are changed during the simulation). In contrast, in the second, all the bits of multiplicand are constant and equal to zero. The other possible scenarios are done by partitioning the multiplicand and multiplier into equal halves, which are illustrated in rows 3 to 8. The finer granularity partitioning is presented in Table  III, that the multiplicand and multiplier are partitioned into four equal parts, and the #scenario is labeled from 9 to 16. The total power consumptions of mentioned sixteen scenarios are plotted in Fig. 8, which provides a better view for comparing these architectures. Based on these results, the proposed designs consume less power/energy than the original multiplier in all scenarios. However, the Mult16-2 design cannot compete with the others due to using tiny multiplier cells, which causes a large multi-operand addition and, as a result, higher power/energy consumption. The Mult16-8 in scenarios #2 and #11 shows the best results since these scenarios are matched with the architecture of Mult16-8 (See Fig.3). The asymmetric architecture (Mult16-8-4) provides best results in scenarios #5, #9, #10, and #12. In other circumstances, the Mult16-4 is the best. Moreover, this architecture can provide about 25% power/energy reduction in scenario #1, which can be considered as a general case when all signals are active. In Fig. 9, the area dissipation of each design is shown. As expected, the implementation based on partitioning techniques enforces the area overhead. Although, this issue is previously highlighted by the leakage power amount in Tables II and III. The circuit complexity and area dissipation are usually increased by applying power reduction techniques. However, the main target of these techniques, as explained in Section 2, is decreasing the switching activity of the entire circuit.
Presented experimental results show, with acceptable area overhead, lower power/energy consumption is achievable.

VI. CONCLUSION
Due to the importance of decimal computation, in this paper, we study a design methodology for reducing the power consumption of decimal multipliers. The suggested method is based on partitioning, which constructs a large multiplier with the smaller ones. The proposed architectures and the original multiplier (which are used as multiplier cells) were implemented via structural VHDL and synthesized with Synopsys Design-compiler under 130 nm TSMC technology library in typical conditions. The experimental results demonstrate 25% power/energy reduction when all the input signals are active. Moreover, it shows that the architecture should be selected based on statistical properties of possible input data, while surprisingly, the asymmetric architecture can provide the best results on specific scenarios.