DESIGNS of portable consumer electronic devices such as mobile phones, PDAs, video games and other embedded systems are increasingly demanding low power consumption to maximize the battery life, reduce weight and improve reliability. These types of power sensitive devices are usually equipped with microprocessors as the processing elements and memories as the storage units. With current CMOS technology, a large portion of power consumption is consumed in the form of dynamic power^{1}, which in turn is determined by the bit switching and the switched load capacitance. Since the microprocessor fetches instructions over the memory bus every clock cycle and bus lines to memory are often much longer than buses within the processor, the power consumed by the bus due to instruction fetch is significant.

In this paper, we target a system consisting of a processor with Harvard architecture, where instruction memory and data memory are separated and each memory has different buses for address and data transmission. We want to reduce switching on the instruction data bus.

Research for instruction data bus switching reduction has generally concentrated on code compression. The compressed code causes less memory access thus reducing the bus activity. Compression requires complicated compression/decompression units, which reside in the critical path and can considerably affect the overall system performance. In this paper, we investigate a different approach—bus encoding.

Most of existing bus encoding schemes are effective for address or data memory buses, and mainly utilize correlations of transferred data. For example, T0 [2] and Gray encodings [3] use the temporal correlation of data on address buses, while the bus-invert [4] exploits the spatial transition correlation among the data bits.

We investigated the data on the instruction data buses and found that the bit switching behavior of the instruction data bus is different from the other types of buses. Fig. 1 shows an experimental result of the bit transition probability for three different memory buses: instruction memory address bus *(imab)*, instruction memory data bus *(imdb)* and data memory data bus *(dmdb)* (all over the 32-bit bus space). The SimpleScalar PISA [5] and the application of audio encoding [6] have been used for our investigation.

As can be seen from Fig. 1, switching activity on the instruction address bus concentrates on the low section of bits, largely due to the sequential access of instruction memory. For the data memory data bus, the switching activity spreads over all bus bits with almost 50% switching probability. But for the instruction data bus, the switching probability is not evenly distributed. Some bits show very low switching activity. Therefore, most of existing encodings for address buses and data memory data buses do not suit for encoding of instruction data buses.

Since there are some bits on the instruction buses with high switching frequency, it is possible to use segmental bus-invert encoding—a set of bus segments are selected and to each segment the traditional bus-invert (**BI**) encoding is performed such that the bus switching activity can be reduced.

We further investigated the bit correlation of the instruction data and found that there is little correlation in the instruction data, as is illustrated by our experimental results shown in Fig. 2, which gives the percentage of bit pairs of the instruction data buses (and the address buses for comparison) in different correlation coefficient ranges. The bigger the coefficient, the higher the correlation of a two-bit pair has. The figure shows that over 80% of bits pairs on the instruction data bus are hardly correlated, with the correlation coefficient below 0.3. In comparison, the address bus data are highly correlated, with about 60% of the bus bit pairs having correlation coefficient over 0.3. Therefore, approaches that are based on the correlation of bit pairs are not effective for the instruction data bus switching reduction.

In this paper, we develop a segmental bus-invert coding (**SBI**) and a fast segment searching algorithm to effectively reduce instruction data bus switching with as small hardware overhead as possible. Our main contributions are:

an analytical model of bus switching reduction for bus segments with the bus-invert encoding, and

a fast segment search algorithm using the instruction-field based search space partition and the Hamming distance (**HD**) of bus segments.

The rest of the paper is organized as follows. Section II reviews some existing bus coding schemes for low power system design. Section III analyzes the effect of bus-invert encoding on switching reduction and area cost, based on which we propose the segmental bus encoding design in Section IV. Section V presents the experimental setup, followed by the simulation results and related discussions. And finally the paper is concluded in Section VI.

Bus encoding techniques for low power consumption have been studied in the last couple of decades.

The Gray encoding [3] was proposed for the instruction address bus where binary addresses are converted into Gray code for bus transmission. When instructions are sequentially executed, the address bus has only one bit flip per instruction.

Another approach [2] for address bus encoding is the asymptotic zero-transition activity encoding, known as **T0**. For the instructions of a program to be executed sequentially without any branches, T0 can ideally achieve zero bus switching.

In [7], Henkel and Lekatsas presented an adaptive address bus encoding (*A*^{2} *BC*) for low power address buses in the deep sub-micron design, where the coupling effects of bus lines were considered.

Stan and Burleson [4] proposed the bus-invert (BI) coding for bus switching activity reduction. This method uses either the original or the inverted value to encode the data bus. If the current value to be sent over the bus causes more than half of the bus bits to switch, its inverted value will be transferred on the bus. An extra invert control line is required to indicate whether the data is inverted or not. This approach achieves a good switching reduction if the transferred data are random and evenly distributed over the whole data range. For the wider data bus without evenly distributed random distributed data, the same authors proposed a partitioned bus-invert coding, partitioning the wider bus into several narrower sub-buses and applying the BI encoding to each sub-bus. This partitioning approach improves the switching reduction at the cost of extra invert control lines.

The partitioned bus-invert approach has been modified and proposed as partial bus-invert (**PBI**) coding [8] for the address bus. The approach selects and encodes a subgroup of bus lines that are correlated and frequently switched. In the same paper, they extended this approach to multi-way partial bus-invert (**MPBI**), where highly correlated bus lines were clustered into multiple sub-buses and each of them was encoded independently.

A dictionary-based approach to reduce the data bus power consumption has been introduced in [9]. This approach exploits frequent data patterns detected from the application trace and uses two synchronized dictionaries on both sides of the bus. The dictionaries cache recently transferred data so that the same data that can be accessed in the local dictionary will not be transferred on the bus to reduce bus switching activity.

For the instruction bus power reduction, most previous researchers have focused on code compression. The pioneer work by Wolfe and Chainin [10], mainly aimed for program memory reduction. With their approach, the total bus switching activity can be reduced via compressed code that are transferred over the bus. A decompression unit is required to restore each instruction before execution.

Scheme in [11] also compresses instructions and compacts more compressed instructions into one bus word to reduce the total number of memory access, hence the total number of bus switches. This code compression scheme was recently extended in [12] to further reduce switching between consecutive instruction words.

Petrov and Orailoglu [13] introduced an instruction bus encoding, where the major loops are encoded and stored in the memory so that when they are fetched, the switching activity on the bus is minimized. This approach can achieve good switching reduction but requires a complex code transformation and control in the decoding logic.

The bus encoding problem was even generalized by Ramprasad *et al.* in [14]. In this paper, the authors present an encoding framework where an encoding can be abstracted as a two-step process: decorrelating and encoding. Data to be transferred over the bus is first decorrelated for high entropy, which then leads to small encoding code and minimal bus bit switchings.

In this paper, we propose a bus encoding for instruction data buses. Our approach is similar to the PBI/MPBI approach in that we both apply the bus invert (BI) encoding to a set of sub-buses. But there exists a major difference: their approach to finding bus sub-sets for BI application is based on data bit-pair correlations. We found that there is very little bit-pair correlation on instruction data; therefore, their approach is not effective for instruction data bus switching reduction. We propose a segment search algorithm based on Hamming distance to achieve a better result, as will be demonstrated in our results in Section V.

SECTION III

## Bus Invert Encoding

The effectiveness of our approach is closely related to the segments selected for the bus encoding. Therefore, we first study the effect of BI encoding on switching reduction and the hardware overhead, which leads to a search criteria for our design space exploration.

### A. Switching Reduction Rate With BI Encoding

For a sequence of *w*-bit code words, assume their Hamming distances are {*h*_{1}, *h*_{2},…,*h*_{n}},
TeX Source
$$h_i =\sum^w_{j=1} s_{(i-1)_j}\oplus s_{ij},$$where *n* is the length of the code sequence, *s*_{ij} the *j*th bit of the word *i* (denoted by *s*_{i}) in the sequence. The number of bit switches (*SA*) if the sequence is transferred on the bus, is
TeX Source
$$SA = \sum^n_{i=1}h_i.\eqno{\hbox{(1)}}$$When *BI* is applied to this sequence, some words will be bit-inverted, if their Hamming distances are larger than *w*/2, the half of word width. The associated Hamming distances will be changed accordingly. For a word, *s*_{i}, if *s*_{i−1} has been inverted, the new Hamming distance is^{2}
TeX Source
$$\sum^w_{j=1}\overline{s_{(i-1)_j}}\oplus s_{ij} = w - h_i.$$Therefore the Hamming distance of word *s*_{i} can be generalized as
TeX Source
$$H_i = c_{i-1}(w - h_i) + (1-c_{i-1})h_i,\eqno{\hbox{(2)}}$$where *c*_{i−1} is the invert control of the previous transfer; when it equals 1, the previous transferred value has been bit inverted.

The total bit switching saving (*SA*_{save}) is
TeX Source
$$SA_{save} = \sum^n_{i=1}((2h_i -w)c_i-c_i).\eqno{\hbox{(3)}}$$The switching reduction rate (*r*) is
TeX Source
$$r = SA_{save}/SA {\sum^n_{i=1}((2h_i -w)c_i-c_i) \over \sum^n_{i=1}h_i}.\eqno{\hbox{(4)}}$$where *c*_{i} = 1, when *h*_{i} > *w*/2, otherwise, *c*_{i} = 0.

As can be see from Formula 4, when the Hamming distance of each word in the sequence is close to the maximum value, *w*, the reduction rate is close to 100%. If the average HD, *E*(*HD*), is around *w*/2, a larger deviation of HD, *Dev*(*HD*), can achieve a better reduction. If the average HD is small and *E*(*HD*) + *Dev*(*HD*) ≤ *w*/2, the reduction rate becomes very small. Therefore, we use
TeX Source
$$\delta = E(HD) + Dev(HD),\eqno{\hbox{(5)}}$$as a criterion parameter in searching instruction word segments for *BI* encoding. For a segment to be selected for *BI* encoding, we want δ > *w*/2 and δ to be as big as possible.

### B. Bus-Invert Control Logic

For each segment to be applied with bus-invert encoding, there needs to be some control logic for bus-invert operation as illustrated in Fig. 3, where *w* bit lines are applied with the bus-invert encoding.

The logic checks whether the Hamming distance of current data value is larger than half of the segment size and determines the actual bus value to be transferred. The logic circuit contains several computing components: a *w*-bit *inverter (INV)* to invert the input data value; a *w*-bit *register*, made of *D flip-flops*, to store previously transferred data; a *w*-bit *logic xor* (⊕) to find bit transitions; an *adder* (+) to calculate Hamming distance of data transition on the *w* bit segment; a *w*-bit *comparator* ( > ) to compare the Hamming distance with half of the segment size and a *w*-bit *multiplexor* (Mux) to choose between the inverted and un-inverted data values.

The area of each component, except for the adder that has *wlog*(*w*) area complexity, is linearly proportional to the number of bits of the input data, *w*. Since the area of the adder increases dramatically when its input bit size becomes large, we want the segment size to be small. This will be used as a guide in our segment search algorithm.

SECTION IV

## Approach for Segmental Bus-Invert Encoding

Full space search of multiple segments for optimal switching reduction is a time consuming process since there are a large number of possibilities. Just for a single segment in an *n*-bit instruction space, the number of solutions is ∑_{i = 2}^{n} *C*_{n}^{i} (at least 2 bits are required for BI encoding). These solutions will form a huge group of sub sets if *n* is large. To speed up the search process, we propose to partition the search space into several bit divisions and perform the search on each of the divisions.

#### Search Space Partition

We investigated the percentage of transferred segment words whose Hamming distance is greater than half of the segment size for three different cases: 1) no partition, 2) evenly partitioned, and 3) partition based on the instruction fields of PISA [5]. Fig. 4 shows the percentage of transitions that have more than half of the segment bus bits switches for an application. For the case without partition (hence only one segment), just 5% bus transmissions have more than half of bus bits switching. In the case of the even partition, the bus is partitioned into four segments of an equal size, the percentage value is below 20%. With the instruction field-based partition, the segment size varies, but all four segments have a higher percentage than other two cases, which allows for more bit switching reduction if BI encoding is applied. Thus we base our bit space partition on the instruction fields.

For an application, its execution can be represented with a basic block diagram. Instructions within a basic block are executed sequentially. Often, the switching activity is mainly determined by the frequently executed loop blocks (also named as **dominant block** in this paper).

To find a partition, we use instruction types in the dominant blocks. Fields that are sensitive to the input are grouped as one division and the other fields are each treated as a separate division.

#### BI Segment Search

For each bit space division, we investigate all bus segments of different sizes and locations. We use the leftmost bit of the segment to mark the segment location. For each location, we start from the smallest segment of a two-bit window, then increase the window rightward by 1 bit to form a new segment. We compare the new segment with the one currently deemed as the best. If the following condition is satisfied,
TeX Source
$$\delta_{new} - \delta_{best} > = (w_{new} - w_{best})/2\eqno{\hbox{(6)}}$$namely, the extra *w*_{new}− *w*_{best} bits of the new segment will statistically increase the switching reduction, the new segment is recorded as the best segment; otherwise, the new segment is discarded. After all possible window sizes are explored for a location, we continue with another segment location starting from the two-bit window again. This time, it is possible that δ_{new} > δ_{best} but *w*_{new} < *w*_{best} holds. In this case, the new segment with small size *w* but large δ is always recorded as the best segment. This process is repeated until all possible cases are finished.

#### BI Segment Merge

Some *BI segments* locally generated can be merged to save the bus invert control lines with same or improved switching reduction.

Fig. 5 shows an example of merging two code segment sequences. With BI encoding, no bit switching is saved for segment one, and for segment two, five bit switching can be saved. If the two segments are merged, eight bit switchings can be saved; If we apply the sub-set of bus bits in the newly merged segment, as illustrated in the shaded area, a further 1 bit switching can be saved. Therefore, for each merge attempt, we re-run the segment search.

Since large segment may result in large invert control logic, we start from small segments for merge operation so that after merge we have as small number of segments as possible while with each segment being not expensive.

SECTION V

## Experimental Results

To examine the efficiency of our segmental bus-invert coding, we applied this approach to a set of applications from MiBench [6] and compared our approach with the most related encodings: traditional Bus-Invert, Partitioned Bus-Invert, Partial Bus-Invert, and Multi-way Partial Bus Invert. We used ASIPMeister [15] to generate the processor VHDL model as the experimental platform for the applications. The SimpleScalar PISA [5] has been chosen as the target processor instruction set architecture. The instruction format of this architecture can be extended to 64 bits, but 40 bits are actually used in normal designs. Therefore, our simulations adopted the 40-bit instruction format.

The experiment starts with a given application written in C, which is compiled by the SimpleScalar tool and simulated on the processor VHDL model generated by ASIPMeister. The instruction trace over the instruction data bus is extracted during the simulation. This trace is used to determine the bus segments for BI encoding based on our encoding design approach proposed in Section IV. The related BI encoding/decoding and control logic is implemented in the processor VHDL model, which is then synthesized by Synopsys Design Compiler for area and power evaluation based on the Tower 0.18-micron standard cells [16].

#### Bus Switching Reduction and Overheads

Table I gives the simulation results obtained from different bus encoding approaches for each application listed in Column 1. Columns 2 & 3 provide the number of total bit switches and the average switching bits per instruction for each application without any bus encoding. The percentage of the switching reduced with the traditional bus-invert encoding [4] (BI) is presented in Column 4. The switching reduction data and related overheads from the Partial Bus-Invert encoding (PBI) and Multiple Partial Bus-Invert (MPBI) encoding are shown in Columns 5–8, and Columns 9–13, respectively; where %Red stands for the switching reduction rate, *I* the number of invert control bus lines, *S* the total number of bus lines for bus-invert encoding, and *Area* and *Power* each correspondingly for the area (in μ*m*^{2}) and power (in *mW*) overhead of the encoding/decoding logic. Columns 14–18 give the simulation results from our encoding approach (SBI).

From the table, we can see the traditional BI encoding achieves very little switching reduction (on average, only 4.1%). This ineffectiveness can also be seen from the other encodings: with average reduction rates from PBI and MPBI being 11.2% and 17.6%, respectively; for some application, the reduction rate is as small as just 7.1%. By using our segmental bus-invert encoding approach, however, we can achieve from 22.3 % up to 42.1% switching reduction. On average, 30.3% bus switching can be reduced with SBI.

Switching reduction is achieved at the cost of extra control lines and the associated control logic for each BI segment, thus incurring the overhead of area and power. As BI has extremely low switching reduction efficiency, not suitable for the instruction data bus switching reduction, we only compare PBI, MPBI with our approach for encoding/decoding logic overhead. As can been seen from Table I, SBI achieves considerable switching reduction at a lower cost than MPBI. Among them, PBI is the cheapest to encode, but it is also the least effective.

#### Power Savings Estimation

We use following formula to estimate the net power savings of SBI, PBI and MPBI encodings:
TeX Source
$$P_{save} = 0.5 \ast C_{bus} \ast V_{dd}^2 \ast f \ast (switch./insn.) \ast Red\hbox{\%} - P_{logic},$$where *C*_{bus} is the bus load capacitance, *V*_{dd} the supply voltage, *f* the frequency, (*switch./insn.*) the switched bus bits per instruction, *Red*%, the switching reduction rate, and *P*_{logic} the encoding/decoding logic power consumption estimated with the Design Compiler.

The bus capacitance varies with the system architecture and low level implementation. The load capacitance of the off-chip bus is normally multiple orders of magnitude higher than that of standard cells. Based on the 0.3*pF* standard cell capacitance, the supply voltage (1.8*v*) and clock frequency (100*MHZ*) used in Synopsys DesignPower, as well as the average (*Switch./insn.*) and %*Red* from our experiment (last row in Table I), we calculate the power savings with different bus capacitances, ranging from 3–300*pF*. The results are plotted in Fig. 6. It can be seen, from the plots, SBI brings higher savings than other two encodings. With increase of the bus capacitance, the power saving of each encoding reaches to their switching reduction rate, as depicted by the horizontal lines in the figure. On the other hand, when the bus capacitance is decreased to a certain value (e.g., 3*pF* or 10 times of the cell capacitance, for PBI), the power overhead of the encoding/decoding logic will cancel out the power saving from the bus switching reduction.

In this paper, we have discussed the switching reduction of the instruction memory data bus for lower power processor-based systems with the Harvard architecture.

We found that the data on the instruction data bus have little temporal correlation, and the randomness of the data is also hardly exploited by the existing bus encodings due to its unevenly bit switching distribution. We proposed a segmental bus-invert encoding that can take the simplicity of the encoding approach and at the same time effectively reduce bus switching activity.

Our encoding idea is similar to the existing multi-way partial bus invert (MPBI) approach. But we use a different search algorithm for bus segments so that by applying the bus invert encoding to each of the segments, we can achieve an average 30% switching reduction on a set of benchmarks, in contrast to the 17.6% obtained by MPBI. In addition, compared to the traditional bus invert encoding, our approach comes with the reduced area for encoding/decoding logic, with an average of two more extra control lines. In contrast, MPBI requires three additional lines.