A Low Power Radix-4 Booth Multiplier With Pre-Encoded Mechanism

The radix-4 Booth algorithm is widely used to improve the performance of multiplier because it can reduce the number of partial products by half. However, numerous additional encoders and decoders would cause the power consumption of the Booth multiplier to be considerable. In this paper, a new radix-4 Booth pre-encoded mechanism is proposed to reduce the power consumption of the Booth multiplier. The proposed design can effectively reduce the power of the Booth multiplier dissipated in the redundant activities by disabling the Booth encoders and decoders from unnecessary working. Particularly, since the control signals are generated early at the pipeline input register before the multiplier, the performance of our design is better than the traditional Booth multiplier. Based on the TSMC 40 nm technology, the simulation results show that the proposed pre-encoded mechanism can reduce the dynamic and static power by 45% and 65%, respectively, compared to the traditional 16-bit radix-4 Booth multiplier. Compared to the previous designs, the proposed design keeps the feature of race-free and has lower power consumption. Even compared to the approximate design, the proposed design has better power efficiency and can provide the exact products.


I. INTRODUCTION
Many digital signal processing (DSP) and machine learning applications are heavily dominated by multiplication [1]- [4], e.g., more than 90% convolutional neural networks (CNN) computations are occupied by multiply-accumulate (MAC) operations [5], [6]. Therefore, the multiplier is an important component in various hardware platforms. The conventional multiplication includes three major phases [3], [7]- [9]. (1) Two inputs (multiplier and multiplicand) are multiplied to generate the partial products (PPs). (2) Reducing the PPs' matrix into two rows by partial product reduction schemes (3) The final carry propagated addition of the remaining two rows of PPs. Particularly, the second phase plays a significant role in power consumption, cost, and overall performance [7]- [9]. Then, the radix- 4 Booth algorithm can improve the performance of multiplication because the radix- 4 Booth multiplier can reduce the number of PP rows by half [3], [10], [11].
The associate editor coordinating the review of this manuscript and approving it for publication was Abdallah Kassem . The authors of [11] provided a simple and intuitive encoding/decoding method to implement the radix-4 Booth algorithm; reference [12] provided a modified sign extension structure to reduce the cost and to improve performance. In [16], the author found that the traditional radix-4 Booth implementations [13]- [15] may result in unnecessary glitches of PPs. Thus, the author of [16] proposed the glitch-free Booth encoder and partial product generator to eliminate the unnecessary glitches of the radix-4 Booth multiplier. However, these traditional designs still suffer from high power consumption and high cost of Booth encoders and decoders. Thus, the authors of [17] proposed a high performance and low cost radix-4 Booth decoder. The decoder of [17] also keeps the advantage of race-free and its cost is less than the cost of [16]. Research [18] developed a neg/two/onenf generator (encoder and decoder) to reduce the glitches in the second phase of multiplication. This neg/two/one-nf generator has less encoded signals and its signal paths are more balanced than other schemes. In [19], in order to reduce the cost of the Booth encoder and decoder, the authors proposed a novel modified Booth encoder (NMBE) scheme that is based on the pass transistor logic (PTL).
For some error-tolerant applications, the approximate circuits can be employed to achieve low power, low circuit complexity, and high performance [9], [20]- [22]. The traditional radix-8 Booth algorithm can generate fewer PPs than the radix-4 Booth algorithm, but it needs additional adders to process the operation of odd multiples of the multiplicand. Therefore, the approximate 2-bit adder [20] was proposed to generate the triple multiplicand with no carry propagation to improve the performance. In [21], the approximate radix-4 Booth multipliers were proposed by using their approximate Booth encoders and approximate Wallace tree structure. In [22], three approximation techniques for the radix-4 Booth multiplier were proposed and these designs can reduce the logic complexity of the PP generator.
In this paper, we propose a radix-4 Booth multiplier with pre-encoded mechanism to improve the power efficiency of multiplication. One specific feature of the radix- 4 Booth algorithm is that when the continuous three bits of multiplier Y (y 2i+1 , y 2i , y 2i−1 ) have the same values, the corresponding PPs will be 0. This feature inspired us to find the ''0X '' case earlier to reduce the unnecessary switching activities of the radix-4 Booth encoders and decoders. Thus, we propose the pre-encoded scheme to detect the ''0X '' case before every multiplication. When the ''0X '' case occurs, the proposed pre-encoder will turn off the Booth encoders and decoders to save power, and set the corresponding PPs to 0 directly before the starting of multiplication. The proposed design is simulated with Taiwan Semiconductor Manufacturing Company (TSMC) 40 nm technology. Compared with the traditional design [16] and the related designs [18], [19], [22], the simulation results show that the proposed design outperforms these designs in terms of transistor count, delay, and power consumption.
The rest of this paper is organized as follows. Section II reviews the traditional radix-4 Booth multiplier and the related works. Section III describes the low power radix-4 Booth multiplier with pre-encoded mechanism in detail. Section IV shows the simulation results of the proposed design. Section V offers a brief conclusion of this paper.

II. TRADITIONAL RADIX-4 BOOTH MULTIPLIER AND RELATED WORKS
Multiplication is a basic arithmetic operation; many DSP and machine learning applications are highly multiply-intensive [1]- [4]. Therefore, the power consumption and performance issues of the multiplier are important. However, the traditional array multiplier generates a lot of PPs (n × n multiplication has n PP rows) and accumulates all PPs to get the final product; it consumes huge power and is not power efficiency. Then, the Booth algorithm (radix-2 Booth algorithm) has been proposed to improve the performance of the multiplication [23]; the radix-4 Booth algorithm (also called modified Booth algorithm) [24] can reduce the number of PP rows by half to facilitate the multiplication. In this section, we introduce the traditional radix-4 Booth algorithm and the related works.

A. TRADITIONAL RADIX-4 BOOTH ALGORITHM
The radix-4 Booth algorithm is a powerful method to improve the performance of multiplication and applies to two's complement operands. The radix-4 Booth algorithm partitions the multiplier Y (y n−1 y n−2 . . . y 0 ) into overlapping groups of contiguous three bits. Each group is encoded and then decoded with multiplicand X (x n−1 x n−2 . . . x 0 ) to generate the corresponding PPs. The n-bit multiplication of radix-4 Booth algorithm can be expressed as follows: multiplicand X and multiplier Y are n bit two's complement numbers. According to the continuous three bits of multiplier Y (y 2i+1 , y 2i , y 2i−1 ), the radix-4 Booth algorithm can generate the corresponding coefficient M i . The coefficient M i has five possible values (±1, ±2, or 0) as shown in Table 1. Refer to (1), the radix-4 Booth algorithm can reduce the number of PP rows by half, and one specific feature is that the corresponding PPs will be 0s when the continuous three bits of multiplier Y (y 2i+1 , y 2i , y 2i−1 ) (M i = 0) have the same values. As shown in Table 1, the continuous three bits of multiplier Y (y 2i+1 , y 2i , y 2i−1 ) have the same values that indicate the ''0X '' case. Notice that y −1 is always 0 when i equals to 0. Although the radix-4 Booth algorithm can reduce the number of PPs, the cost of the radix-4 Booth multiplier is still high. Reference [12] provided a modified sign extension structure which can reduce the number of PPs and the cost further. Fig. 1 shows the traditional radix-4 Booth multiplier with the modified sign extension structure. Each PP row has one Booth encoder and n + 1 Booth decoders. Booth encoders and decoders generate the PPs; full adders and carry-propagation adder add all PPs to get the final product. However, the unnecessary glitches and switching activities occur in the traditional radix-4 Booth multiplier because of the unbalanced signal paths of radix-4 Booth encoders and decoders. That leads to high power consumption [16]- [18]. The overall power consumption of CMOS can be defined as The first and second terms are dynamic and static power consumption, respectively. Where a is the switching activity parameter, C is the total capacitance load, V dd is the supply voltage, f is the operating frequency, and I leakage is the leakage currents [25], [26]. According to (2), the power consumption can be reduced by minimizing the switching activities.

B. RELATED WORKS
The radix-4 Booth encoder and decoder of [11] is the most common implementation of the radix-4 Booth algorithm, and the authors of [13]- [15] provided more compact implementations. However, the author of [16] found these designs have unnecessary glitches caused on PPs. Therefore, the glitch-free Booth encoder and partial product generator (decoder) have proposed to reduce the unnecessary glitches. In [16], the propagation delay from inputs X and Y to PP is only two units (one XOR/XNOR and the complex gate for output), and all paths almost have the same propagation delay. In [17], the authors proposed a high performance and low cost radix-4 Booth decoder that also keeps the advantage of race-free. Especially, the cost of [17] is less than the cost of [16].
In [18], the authors proposed a neg/two/one-nf (nf means neg-first) generator as shown in Fig. 2, and the neg-first radix-4 Booth algorithm is shown in Table 2. The negfirst radix-4 Booth algorithm is the three-signal scheme; signals one i , two i , and neg i can indicate that (y 2i+1 , y 2i , y 2i−1 ) belongs to which case. The signal neg i is for negation operation and cor i is the correction bit for negative operation. The neg-first means that the negation operation is done before the selection between ''1X '' and ''2X ''. For generating PP, the neg/two/one-nf decoder adopts the OR-AND-INV (OAI) gate, and all input signals of this OAI gate almost arrive at the same time (about 1 XNOR gate delay) as shown in Fig. 2. Therefore, the signal paths are more balanced than other schemes and the glitches can be reduced. However, in the case of ''−0X '', neg i = y 2i−1 may lead to more switching activities in the XNOR gate since signals one i and two i can set PP to 0 regardless of neg i . Note that the neg-first design needs one more compact decoder to generate the first nx signal and this decoder no needs to generate the PP as shown in Fig. 3. Therefore, each PP row of the neg-first design has one encoder and n + 2 decoders.
The NMBE scheme [19] which is based on the PTL has less cost than traditional design. But, this scheme suffers from the problems of weak 1/strong 0 and the signal paths are not balanced. Research [22] provided three approximation techniques for radix-4 Booth multipliers and one of them is called approximate Booth multipliers models 1 (ABM-M1). ABM-M1 is composed of the exact partial product generators (radix-4 Booth encoder and encoder) and the approximate 2-signal partial product generators (called PPG-2S). Take 8-bit ABM-M1 with m = 4 for example, the PPs with a significance less than 4 are generated by PPG-2Ss and the remaining PPs are generated by the exact partial product generators. ABM-M1 can provide useful results with the low area-power product.

III. THE PROPOSED RADIX-4 BOOTH MULTIPLIER WITH PRE-ENCODED MECHANISM
In this paper, we propose a pre-encoded mechanism to reduce the power consumption of the radix-4 Booth multiplier. As mentioned above, the unnecessary switching activities make multiplier consume more power, and the PPs must be 0s in the ''0X '' case. For that reason, we propose a pre-encoded mechanism to find the ''0X '' case earlier to reduce the unnecessary switching activities of the radix-4 Booth encoders and decoders. When detects the ''0X '' case, the proposed pre-encoded mechanism can turn off Booth encoders and decoders immediately to save power.
The architecture and the timing chart of the proposed pre-encoded mechanism are shown in Fig. 4 and Fig. 5, respectively. As shown in Fig. 4, the proposed pre-encoded mechanism is composed of multiplicand and multiplier registers, additional proposed pre-encoders, and radix-4 Booth multiplier (includes adders, the proposed low cost Booth encoders and decoders). The signals x n−1_db . . . x 0_db ( db means data bus) and y n−1_db . . . y 0_db denote the multiplicand X and multiplier Y on the data bus, respectively; x n−1 . . . x 0 and y n−1 . . . y 0 denote the outputs of the multiplicand and multiplier registers. As shown in Fig. 5, the multiplicand X and multiplier Y would be set on the data bus during the data setup time before the multiplication (multiplication phase). The proposed mechanism can detect the ''0X '' case during this setup time (denotes as pre-encode phase). In order to detect the ''0X '' case before the multiplication, the proposed design needs additional pre-encoders. If the ''0X '' case occurs, the proposed pre-encoders will immediately turn off the corresponding Booth encoders and decoders, and the corresponding PPs will be set as 0s to reduce the switching activities. In contrast, the Booth encoders and decoders will work as usual. Because the pre-encoders process the ''0X '' case already, the proposed Booth encoders and decoders only    need to process the ''±1X '' and ''±2X '' cases, then, the cost of our design can be reduced. Accordingly, the proposed pre-encoded mechanism has less cost than the other designs.

A. THE PROPOSED PRE-ENCODER
The proposed pre-encoded mechanism needs pre-encoder to detect the ''0X'' case in the pre-encode phase. Table 3 shows the proposed pre-encoded radix-4 Booth algorithm. The proposed pre-encoded mechanism has three encoded signals. Signal zero i which is generated by the proposed pre-encoder can determine the continuous three bits of multiplier Y on the data bus are the same or not. Signals neg i and ot i are generated VOLUME 8, 2020 by the proposed encoder. Signal neg i is for the negation operation; signals neg i and ot i are for the remaining cases (''±1X'' and ''±2X''). Note that signal zero i has the highest priority, cor i is the correction bit for negative operation, and d means ''don't care''.
According to Table 3, when y 2i+1 , y 2i , and y 2i−1 on the data bus have the same value, the ''0X '' case occurs, thus, the signal zero i will be 1. The equation of zero i can be written as where db means that the y 2i+1 , y 2i , and y 2i−1 arrive at the data bus in the pre-encode phase. According to (3), the proposed pre-encoder can be implemented with the AND-OR-INV (AOI) gate and a latch-like circuit as shown in Fig. 6(a). Fig. 6(b) shows the proposed pre-encoder at the transistor level. In the pre-encode phase, when the multiplier Y arrives on the data bus, the pre-encoder starts to detect the ''0X '' case. The signal zero i is generated to control the proposed low cost encoder and decoders (introduce in section III-B) of the ith PP row; the encoder and decoders work as usual or not in the multiplication phase according to the signal zero i . 1) ''Non-0X '' cases: In the pre-encode phase, if the y 2i+1_db , y 2i_db , and y 2i−1_db are not the same, the signal zero i will be set to 0 by the proposed pre-encoder. Therefore, the proposed encoder and decoders of the ith PP row will work as normal to generate the corresponding PPs in the multiplication phase. 2) ''0X '' case: In the pre-encode phase, if the y 2i+1_db , y 2i_db , and y 2i−1_db are the same, the signal zero i will be set to 1 by the proposed pre-encoder. Thus, the proposed encoder and decoders of the ith PP row will be powered off to reduce power consumption, and the corresponding PPs will be set to 0s directly in the pre-encode phase. In the multiplication phase, these gated encoder and decoders are no need to work since they are already turned off and the corresponding PPs are already set to 0s. Note that the latch-like circuit shown in Fig. 6 is added to prevent the probably happened unnecessary switching activities when ''0X '' case changes to ''non-0X '' case. Fig. 7 shows the latch-like circuit operations in this special situation. As shown in Fig. 7(a), when ''0X '' case changes to ''non-0X '' case (signal Z i changes from 0 to 1) in the pre-encode phase, the pre-encoder without the latch-like circuit generates zero i as 0 immediately that makes encoder and decoders turn on to do the redundant multiplication of the previous time (''0X '' case). To avoid this redundant multiplication, the latch-like circuit is required. As shown in Fig. 7(b), when ''0X '' case changes to ''non-0X '' case in the pre-encode phase, Clk is 0, and N2 is turned off. The pre-encoder with the latch-like circuit makes zero i keep the previous value at a high voltage level to turn off the encoder and decoders, then, the redundant multiplication can be avoided. However, until Clk changes from 0 to 1, N2 is turned on and the pre-encoder with latch-like circuit makes zero i as 0 through N0 and N2 to power on the encoder and decoders; the ''non-0X '' multiplication starts normally in the multiplication phase.

B. THE PROPOSED LOW COST RADIX-4 BOOTH ENCODER AND DECODER
Because the ''0X '' case has been processed by the proposed pre-encoder, the encoder and decoder only need to process the remaining cases (''±1X '' and ''±2X ''). For that reason, the costs of radix-4 Booth encoder and decoder can be reduced. In this paper, we propose the low cost radix-4 Booth encoder and decoder as shown in Fig. 8 and Fig. 9 to reduce the power consumption further. The gating techniques (power gating and ground gating) and the low cost of our design can reduce the dynamic and also static power consumption effectively. In particular, the issues of reducing static power consumption become more and more important when technology progresses [26], [27]. Table 4 shows a summary of the proposed pre-encoded radix-4 Booth algorithm. Signal zero i can be used as the control signal of the gating transistors that are added in the proposed encoder and decoder. When zero i is 1, the proposed encoder and decoder will be gated to reduce power consumption. In contrast, the proposed encoder and decoder will work as usual. Fig. 8(a) and Fig. 8(b) show the details of the proposed encoder. Like the traditional design, the inputs y 2i+1 , y 2i , and y 2i−1 of the proposed encoder are the outputs of the corresponding multiplier registers. Note that the inputs of the proposed pre-encoder are the multiplier Y on the data bus since the pre-encoder needs to detect the ''0X '' case earlier in the pre-encode phase. Because of this pre-encoder, the proposed encoder only needs to generate signals neg i and ot i . According to Table 3, the expressions of neg i and ot i are Signal neg i is equal to y 2i+1 and signal ot i can be generated by an XOR gate according to (4) and (5), respectively. The XOR gate can be implemented with the CMOS logic [28].
To reduce the power consumption in the ''0X '' case, the XOR gate of the proposed encoder adopts the gating techniques. As shown in Fig. 8(b), the gating transistor P E0 is added between the power supply and the XNOR logic, and is controlled by zero i . The gating transistor N E0 is added between GND and the XNOR logic, and is controlled by zero i . Signals zero i and zero i are generated by the proposed pre-encoder as introduced above. 1) ''Non-0X '' cases: In the pre-encode phase, if the y 2i+1 , y 2i , and y 2i−1 are not the same, the signal zero i is 0. Thus, the gating transistors P E0 and N E0 of the proposed encoder will be turned on. The proposed encoder works as normal to generate the corresponding encoded signals in the multiplication phase. 2) ''0X '' case: In the pre-encode phase, if the y 2i+1 , y 2i , and y 2i−1 are the same, the signal zero i is 1. Thus, the  gating transistors P E0 and N E0 of the proposed encoder will be turned off to reduce the dynamic power consumption and leakage currents. In the multiplication phase, the proposed encoder still stays in the standby mode to save power. According to Table 3, the expression of correction bit cor i can be written as Based on (6), the correction bit can be generated by an AND gate that is simpler than the circuitry in [18]. The correction bit cor i is 0 in the ''0X '' case; otherwise, the value of cor i depends on y 2i+1 (neg i ) as summarized in Table 4. Fig. 9(a) and Fig. 9(b) show the details of the proposed decoder. Like the traditional design, the input x j of the proposed decoder is connected with the output of the corresponding multiplicand register. According to the encoded signals of the proposed pre-encoder and encoder, the proposed decoder can generate the corresponding PP. As introduced before, the PP should be 0 when the ''0X '' case occurs. The PP of ''+1X '' is x j and the PP of ''−1X '' is x j . The PP of ''+2X '' is x j−1 and the PP of ''−2X '' is x j−1 . Table 4 summarizes the PP value of each case; the PP values of ''±1X '' and ''±2X '' can be defined as (7) and (8), respectively.
Based on Table 4, the expression of PP value can be written as According to (9), the proposed decoder can be composed of an XOR gate and two multiplexers (MUXs). The XOR gate VOLUME 8, 2020 is implemented with the CMOS logic [28] and its output is shared with the neighbor decoder. The first MUX which is controlled by ot i is implemented by the transmission gates (TGs); the second MUX which is controlled by zero i can be implemented easily by an NMOS. Like the proposed encoder, the proposed decoder adopts the gating techniques. As shown in Fig. 9(b), the gating transistor P D0 which is controlled by zero i is added between the power supply and the XOR logic; the gating transistor N D0 which is controlled by zero i is added between GND and the XOR logic.
1) ''Non-0X '' cases: In the pre-encode phase, the signal zero i is 0 when the y 2i+1 , y 2i , and y 2i−1 are not the same. In the multiplication phase, the gating transistors P D0 and N D0 of the proposed decoder are turned on to generate the signal nx j as normal. Therefore, the proposed decoder can generate the corresponding PP according to the encoded signals ot i and zero i . 2) ''0X '' case: In the pre-encode phase, the signal zero i is 1 when the y 2i+1 , y 2i , and y 2i−1 are the same. Since zero i is 1, the PP will be set to 0 directly in the pre-encode phase. In addition, the gating transistors P D0 and N D0 of the proposed encoder are turned off to reduce the dynamic power consumption and leakage currents. In the multiplication phase, the PP is 0, and the proposed decoder still stays in standby mode to save power.
As shown in Fig. 10, each PP row of the proposed preencoded mechanism has one pre-encoder, one encoder, and n + 2 decoders (10 decoders for n = 8) that is similar to [18]. Clearly, the decoders account for the majority of the circuitry cost of each PP row. To minimize the cost of the proposed decoders, the additional gating transistors can be shared. As shown in Fig. 10, every five decoder shares a set of gating transistors. Take 8-bit multiplication (n = 8) for example, there are 10 decoders including the compact decoder in each PP row, and these decoders can be divided into two groups. In the first group, decoder 0 to decoder 3 and the compact decoder share the gating transistors P D0 and N D0 . Notice that the compact decoder is implemented only by the XOR gate with the gating transistors because the compact decoder no needs to generate the PP. In the second group, decoder 4 to decoder 8 share the gating transistors P D1 and N D1 . When zero i is 0, these two groups work as usual to generate the corresponding PPs. When zero i is 1, these two groups will be turned off to reduce power consumption and set PPs of this row to 0s immediately. Table 5 summarizes the comparisons for the proposed pre-encoded design and the related designs [16], [18], [19], [22]. The traditional design [16] is the four-signal scheme and have balanced signal propagation paths. But, the cost of traditional design is higher than the others. The NMBE design [19] is the three-signal scheme and based on the PTL. The cost of the NMBE design [19] is low, but the signal paths are not balanced. Moreover, the NMBE design [19] suffers from the problems of weak 1/strong 0 that makes the static power increase. In order to reduce the circuit complexity, the ABM-M1 design [22] adopted the approximate decoders (PPG-2S) in some least-significant bits. The encoder of [22] has 50 transistors which is larger than the others since this encoder needs to encode for the exact and approximate decoders. The ABM-M1 design [22] can only be used in the error-tolerant applications. Same as the neg-first design [18], the proposed design has three encoded signals and balanced signal propagation paths. However, the neg-first design [18] has the ''−0X '' case that may lead to more switching activities in XNOR gates of decoders. For n-bit multiplication of design [18] and the proposed design, one PP row needs one encoder and n+ 2 decoders (one additional pre-encoder for the proposed design). Obviously, the proposed design has less cost than the others even though the proposed design needs the additional pre-encoder. Because of the pre-encoder, the proposed design can turn off the encoder and decoders to save power in the ''0X '' case. Therefore, we expect that the proposed design has the better power efficiency than the other designs.

IV. SIMULATION RESULTS
In this paper, the related works [16], [18], [19], [22], and the proposed pre-encoded mechanism are simulated by using TSMC 40 nm CMOS technology. The supply voltage is 1.0V, the clock frequency is 100 MHz, and the simulation is done by HSPICE tool. We simulate generating one PP row of n-bit multiplication (n = 8 or n = 16); we provide the power consumption, the performance, and transistor count (TC) to show the effectiveness of the proposed pre-encoded mechanism. We also provide the overall comparisons of multipliers to prove the superiority of the proposed design over the related works.

A. FUNCTIONALITY
To verify the feasibility and correctness of the proposed preencoded mechanism, the TSMC 40 nm technology is used to simulate two specific scenes with HSPICE. One specific scene is changing from ''0X '' case to ''non-0X '' case and the other is changing from ''non-0X '' case to ''0X '' case. Fig. 11 and Fig. 12 show the waveforms of these two specific scenes. The black solid line is the clock signal Clk, the blue solid  line is the encoded signal zero i , and the green solid line is the decoder output PP ij . The dotted red and solid purple lines indicate y 2i_db for the data bus and the register output y 2i , respectively. Fig. 11 shows the waveforms of ''0X '' case changing to ''non-0X '' case. We choose i = 1 and j = 1 to describe the waveforms in detail. Suppose that the multiplicand X is fixed at 01010101 and the continuous three bits of multiplier Y (y 3 , y 2 , y 1 ) change from (0, 0, 0) to (1, 1, 0), that is, ''0X '' case changes to ''−1X '' case. In the beginning, (y 3 , y 2 , y 1 )= (0, 0, 0), thus, zero 1 is 1 and PP 11 is 0. When y 3_db and y 2_db are set to 1 (y 3 and y 2 are still 0) in the pre-encode phase, the pre-encoder output zero 1 keeps the high voltage to prevent unnecessary switching activities as introduced in section III-B until Clk changes from 0 to 1. When Clk changes from 0 to 1 (the multiplication phase), registers store the new data from the data bus, and zero 1 changes from 1 to 0 to make the encoder and decoders work as normal. PP 11 changes from 0 to 1 because y 3 (neg 1 ) is 1, x 1 is 0, and zero 1 is 0. Fig. 12 shows the waveforms of ''non-0X '' case changing to ''0X '' case. We choose i = 1 and j = 1 to describe the waveforms in detail. Suppose that the multiplicand X is fixed at 01010101 and the continuous three bits of multiplier Y (y 3 , y 2 , y 1 ) change from (0, 1, 1) to (0, 0, 0), that is, ''+2X '' case changes to ''0X '' case. In the beginning, (y 3 , y 2 , y 1 )= (0, 1, 1), therefore, zero 1 is 0 and PP 11 is 1 (y 3 = 0 and x 0 = 1). When y 2_db and y 1_db are set to 0 (y 2 and y 1 are still 1) in the pre-encode phase, the pre-encoder output zero 1 is set to 1 directly to turn off the encoder and decoders. Notice that PP 11 is also set to 0 immediately in the pre-encode phase because the pre-encoder detects the ''0X '' case. When Clk changes from 0 to 1 (the multiplication phase), registers store the new data from the data bus, and the encoder and decoders are turned off already to reduce power consumption. Fig. 11 and Fig. 12 prove the correctness and feasibility of the proposed pre-encoded design.

B. POWER CONSUMPTION ANALYSIS
The proposed pre-encoded design is compared with the related works [16], [18], [19], [22]; we simulate generating one PP row with 8-bit multiplication and 16-bit multiplication for each design. The continuous three bits of multiplier Y (y 2i+1 , y 2i , y 2i−1 ) change from one pattern to each pattern. We simulate these pattern switches with the 100 MHz clock frequency and provide the average power consumption for each pattern. The power consumption of the proposed design is composed of the power consumption of pre-encoder, encoder, and decoders.
The power consumption of each related work is composed of the power consumption of encoder and decoders. Note that we simulate one row of the ABM-M1 design [22] with the approximation factor m = 4 (m = 8) for the 8-bit multiplication (16-bit multiplication). When m = 4 (m = 8), one row of the ABM-M1 design [22] is composed of one encoder, five (nine) exact decoders, and four (eight) PPG-2Ss. Fig. 13 shows the average dynamic power consumption for each pattern switch. For instance, ''000'' in Fig. 13 represents the average power consumption for changing from each pattern to pattern ''000''. Obviously, the dynamic power consumption of the proposed design is less than the power consumption of the other designs for each pattern switch because the proposed design has the advantages of less cost and race-free. The dynamic power consumption of the NMBE design [19] is larger than the traditional design [16] in some cases (especially ''000'') because of the strong 0/weak 1 problems. Table 6 summarizes the average dynamic power consumption of pattern switches for generating one PP row. The NMBE design [19] has the largest dynamic power consumption due to the drawbacks of the poor voltage level. The results of the ABM-M1 design [22] in Table 6 are simulated with approximation factor m = 4 (8-bit) and m = 8 (16-bit). For 8-bit (16-bit) case, compared to the traditional design [16] and the neg-first design [18], the proposed design can reduce the dynamic power consumption by 70.4% (75.4%) and 24.9% (32.1%), respectively; compared to the approximate design [22], the proposed design can reduce the dynamic power consumption by 43.3% (48.6%) and provide precise products. Fig. 14 shows the static power consumption for each pattern. Take ''000'' for example, the multiplier Y (y 2i+1 , y 2i , y 2i−1 ) is fixed at 000 to get the static power consumption of ''000'' pattern. As shown in Fig. 14, in patterns ''000'' and ''111'', the proposed design can save more static power consumption since the proposed pre-encoder can turn off the encoder and decoders in these patterns (the ''0X'' case). The static power consumption of the NMBE design [19] is larger than the traditional design [16] in some cases (especially ''000'') because of the drawbacks of the poor voltage level. Table 6 summarizes the average static power consumption of the eight patterns for one PP row. The NMBE design [19] still has the largest static power consumption. For 8-bit (16-bit) case, compared to the traditional design [16] and the neg-first design [18], the proposed design can reduce the static power consumption by 71.5% (75.8%) and 23.3% (27.4%), respectively; compared to the approximate design [22], the proposed design can reduce the static power consumption by 40.6% (44.6%).

C. PERFORMANCE AND COST ANALYSES
The performance and cost analyses for one PP row are summarized in Table 6. The worst case delay is the metric to evaluate the performance. For example, the worst case delay of the neg-first design [18] will occur when signal two i changes from 0 to 1 or 1 to 0. The worst case delay of the proposed design will occur when the encoded signals change from ''0X '' to ''non-0X '' case. As shown in Table 6, the NMBE design [19] has the longest delay due to the weak drivability of signals. For 8-bit (16-bit) case, compared to the traditional design [16] and the neg-first design [18], the delay reductions of the proposed design are 20.9% (34.4%) and 20.5% (20.7%), respectively. As shown in Table 6, the worst case delay of the ABM-M1 design [22] is contributed by the exact decoder. Compared to the ABM-M1 design [22], the delay reduction of the proposed design is 18.7% (23.6%).  Table 6 also provides the transistor count (TC) to evaluate the costs of each design. For one n-bit PP row, different from the other designs, both of the neg-first design [18] and the proposed design need n+ 2 decoders (one of them is the compact decoder). Take 8-bit for example, the neg-first design [18] needs 182 transistors for one PP row which is composed of one 30T encoder, one 8T compact decoder, and nine 16T decoders. The proposed design needs 164 transistors for one PP row which is composed of one 17T pre-encoder, one 18T encoder, one 8T compact decoder, nine 13T decoders, and four shared gating transistors (two groups). As shown in Table 6, the traditional design has the largest TC; the other designs can effectively reduce the costs and the proposed design has the least TC.

D. COMPARISON OF RADIX-4 BOOTH MULTIPLIERS
In order to show the effectiveness of our design, we also provide the comparisons of the radix-4 Booth multipliers as shown in Table 7. For a fair comparison, each design adopts the same adder array for PP accumulation as shown in Fig. 1. Only the multiplier of the ABM-M1 design [22] is the approximate multiplier, and the approximation factor m = 4 for 8-bit multiplication (m = 8 for 16-bit multiplication). Take 16-bit multiplication with m = 8 for example, the PPs with a significance less than 8 are generated by PPG-2Ss and the remaining PPs are generated by the exact decoders.
Compared to the traditional design [16], the other designs can reduce the dynamic power consumption and TC. However, the NMBE design [19] is the worst in terms of static power consumption and delay because of the drawbacks of the poor voltage level that increases the leakage currents and delay. Because the proposed pre-encoded design has the advantages of race-free, conditional power gating, and low cost, the proposed design is the best design in terms of dynamic power, static power, delay, power-delay product (PDP), and TC as shown in Table 7. For 8-bit (16-bit) case, compared to the traditional design [16], the proposed design can reduce the dynamic power and static power by 43.9% (44.9%) and 61.6% (65.2%), respectively, with 4.0% (5.0%) performance improvement. Compared to the neg-first design [18], the proposed design can reduce the dynamic power and static power by 5.9% (6.9%) and 26.1% (29.4%), respectively. Compared to the approximate design [22], the proposed design can provide the exact results, and can reduce the dynamic power and static power by 26.5% (28.0%) and 38.6% (41.1%), respectively. Obviously, the proposed design has the lowest PDP results. Fig. 13, Fig.14, Table 6, and Table 7 prove the superiority of the proposed design compared to the other designs.

V. CONCLUSION
In this paper, a low power radix-4 Booth pre-encoded mechanism has been proposed to reduce the unnecessary switching activities of encoders and decoders in the ''0X '' case. The proposed pre-encoded mechanism can detect the ''0X '' case earlier and adopts the gating techniques. When the ''0X '' case occurs, the encoder and decoders will be turned off immediately by the proposed pre-encoder to reduce the power consumption and leakage currents. The simulations are done by HSPICE with the TSMC 40 nm technology and the results show that the proposed design can provide significant reductions in power consumption, delay, and transistor count compared with the state-of-the-art encoded designs. For 16-bit multiplication, compared to the traditional radix-4 Booth multiplier, the proposed pre-encoded mechanism has 35% reduction in transistor count, 5% improvement in performance, and can reduce dynamic and static power consumption by 45% and 65%, respectively. Compared to the neg-first design and the NMBE design, the proposed design has better performance, less transistor count, and lower power consumption. Even compared to the approximate design, the proposed design can provide precise products, and can achieve 28% dynamic power reduction and 41% static power reduction. His research interests include computer and microprocessor architecture, digital integrated circuit design, low-power memory design, and approximate circuit design. He is currently a Testing Engineer with Winbond Electronics Corporation. His research interests include digital integrated circuit design, low-power integrated circuit design, and Booth multiplier.
CHUN-HUO HSIAO received the B.S. degree in computer science from the National Taichung University of Education (NTCU), Taichung City, Taiwan, in 2018. He is currently pursuing the M.S. degree with the Department of Computer Science and Engineering, National Chung Hsing University (NCHU).
His research interests include digital integrated circuit design, Booth multiplier, and approximate circuit design. VOLUME 8, 2020