TODAYS System-on-Chip (SoC) are more and more complex and require many computational resources, implying a large volume of data to be stored or to be transmitted. To transfer these data from memory to processor or from one processor to another, on-chip interconnect buses or networks have to be used. In the actual SoC, interconnect can represent up to 50% of the total power consumption [1]. Moreover, the transistor and wire dimension scaling has an increasingly strong impact on the propagation time and the energy due to wires [2]. Therefore, estimation and optimization of power and delay due to interconnections become a major issue in SoC design. Many works have been proposed in the past around interconnect power optimization at different abstraction levels [3], [4], [5], [6], [7], [8], [9]. Unfortunately most of the proposed techniques are not efficient for reducing the power consumption due to interconnects. This is first due to the fact that interconnect length used in their experimentations are not corresponding to nowadays SoCs ones. Therefore, the efficiency of most techniques designed for long interconnects is not valid anymore [10]. It is also due to the hardware complexity used by codecs of the published techniques which consumes too much power compared to the power reduction gain on buses.

Finally, previous works showed that, in the case of crosstalk effects, the impact of delay reduction is quite different of the power one.

In this paper, we propose a data-coding technique called “Convolutional Encoder for Crosstalk Reduction” (CECR) which improves power consumption, propagation time and noise for on-chip buses (A theoritical approach has been introduced in [11]). This paper is organized as follows. Section II presents the previous works that allowed us to develop the proposed optimization technique. The CECR technique and its hardware implementation are described in Section III. In Section IV, experimental results in terms of power consumption, propagation time and noise variation using the technique are discussed. The last section concludes this paper.

Crosstalk is the effect of the coupling capacitance between a victim wire and its neighboring wires and depends on their transitions and states.

First, results presented in [12] show that the transition classification (according to the crosstalk capacitance (*C*_{c}) presented by the victim wire) differ if power consumption or delay is considered. Table I, extracted from [12], shows that transition classification, from a power consumption point of view, starts with rising transitions followed by falling transitions. Therefore, a first key point for power optimization is to encode data such as falling transitions on the bus are achieved with the lowest crosstalk capacitance and thus consume less energy as possible.

Secondly, when the activity profile of data stimuli files is analyzed, it can be noticed that applying performance optimization techniques on least significant bits (LSB) has a better impact in terms of power consumption reduction. This is due to the fact that the LSBs have the strongest activity as demonstrated in [13]. For instance, the Partial Bus Invert technique (Partial Bus Invert is Bus Invert [6] applied to LSBs) has better results in terms of power consumption reduction than the classical Bus Invert.

Finally, to be fair, the energy saving should take into acount the whole system (i.e., the codecs+the wires).

Based on this analysis, the next section introduces the concept of the CECR technique for the optimization of delay, power consumption and noise for on-chip buses.

SECTION III

## Proposed Optimization Technique

The problem can be state on how to encode a sequence of words *E*_{k} of *n* bits in a sequence of word *S*_{k} of *m* bits (i.e., the *m* encoded wires) so that two consecutives values *S*_{k} and *S*_{k+1} minimises both the hamming distance *H*(*S*_{k}, *S*_{k+1}) between two consecutives symbols (power dissipation minimization) and minimises the maximum delay *D* induced by crosstalk phenomena. In this paper, the maximum delay is fixed to 2 (i.e., the switching of two neighbouring consecutive bits is forbidden). In other words, *S*_{k} ⊕ *S*_{k+1} never contains two consecutives 1 (delay condition).

Somehow, this problem appears to be dual of the problem of error correcting code construction. In the later, the code construction tries to maximise the Hamming distance between codewords (or sequence of codewords). Since memory is involved (*S*_{k} and *S*_{k+1} are related), the problem can be solved by looking for a rate *r* = *n*/*m* convolutional encoder that minimize the hamming distance between two consecutives symbols and that verifies the crosstalk condition. This general problem can be optimally solved for *n* = 2 and *m* = 3. In fact, the input symbol *E*_{k} can take 4 different values. Since *S*_{k} ⊕ *S*_{k+1} should not possess two consecutive 1, it takes its values among the set 000, 100, 010, 001, and 101. This list contains 5 symbols. Symbol 101 is discarded from the list because it correspond to a hamming distance of 2 between *S*_{k} and *S*_{k+1}. The four remaining symbols are then enough to code the 4 inputs values. The simplest encoder/decoder to use this coding/decoding scheme is given in the following equation (the aim is to transform the two input bit using a demux structure):
TeX Source
$$\eqalignno{C_k(1) &= \bar{E}_k(1)E_k(0)\cr C_k(2) &= {E}_k(1)\bar{E}_k(0) &\hbox{(1)}\cr C_k(3) &= {E}_k(1){E}_k(0)}$$Generate a transition on the wire where *C*_{k}(*m*) equals 1:
TeX Source
$$S_{k(m)} = S_{k-1(m)}\oplus C_{k(m)}\ {\rm for}\ m = 1,2,3\eqno{\hbox{(2)}}$$The value *S*_{k}(*m*)_{m = 1,2,3} are sent on the three wires.

At the receiver side, the decoding process is symmetrical:
TeX Source
$$D_{k(m)} = S_{k-1(m)}\oplus S_{k(m)}\ {\rm for}\ m = 1,2,3\eqno{\hbox{(3)}}$$Note that, by construction, *D*_{k}(*m*) = *C*_{k}(*m*). Then, symbols *E*_{k}(0) and *E*_{k}(1) are computed as:
TeX Source
$$\eqalignno{E_k(0) &= {D}_k(1)\ {\rm OR}\ D_k(3)\cr E_k(1) &= {D}_k(2)\ {\rm OR}\ D_k(3)&\hbox{(4)}}$$The architecture of the entire corresponding coding/decoding process is illustrated on Fig. 1.

SECTION IV

## Experimental Results

For simulations, the buses have been modeled as a distributed *RC*Π_{3} model considering crosstalk capacitances as defined in [12]. Experimental results have been obtained using a SPICE simulator (ELDO v5.7) for different technologies (130 nm, 90 nm and 65 nm) and for a full random data file and an image stimuli file (it is important to note that our results follow the same behavior for other data files such as music or speech). As each technology has a specific number of metal layers, SPICE simulations have been achieved on all metal layers from the lowest ones (mostly reserved for short wires) to the highest ones (mostly reserved for buses which is our topic of interest). As propagation time becomes critical on interconnects, some techniques are proposed in [14], [15] to accelerate the data propagation by inserting some buffers on wires. Thus, in our experimentations, buffered and non-buffered interconnects have been simulated. Power consumption results have been obtained by considering the extra power consumption due to codecs.

### A. Effects on Delay

As said in the previous section, to minimize the switching activity the maximum hamming distance between two consecutives values is one, in other words only one wire between the three encoded one can switch. Thus the worst transition pattern considering delay will be a *C*_{s}+ 2*C*_{c} class, exactly (−,↑,−) or (−,↓,−).

SPICE simulation results show that, for a 65 nm technology and a 1 mm length, the worst case propagation delay is reduced of 20% when the CECR technique is used (the computed delay includes also coding and decoding logic).

### B. Effects on Energy Consumption

As said in the preliminary section, applying techniques on bits that have the strongest activity leads to the best results. First, results will be presented by considering the worst consumption case which means applying the technique on a full random bit bus to see what is the best power consumption reduction the technique can bring. Then, results will be presented by applying the CECR solution on the 4 least significant bits (which have the same switching activity as random bits as shown in Section IV.C) of an 8-bit data transmission bus, this means that two coding/decoding blocks will be used. As MSB are more correlated, (i.e., the toggling probability is very low as demonstrated in [13]) applying the CECR technique on them will not have any real impact.

The coding and decoding process generate also power consumption that has to be taken into account to evaluate the quality of the proposed method. To do that, SPICE simulations have been performed according to the previous defined set of parameters.

First, as shown on Fig. 2, it can be noticed that the CECR technique efficiency increases with technology shrinking, which is a major issue for energy consumption reduction in current and future technologies.

Secondly, the CECR technique is already efficient on the lowest metal layers, but it is more efficient on the highest metal layers reserved in particular for long buses.

Thirdly, Fig. 2 shows that the longer the bus is, the higher the energy consumption reduction is. The energy consumption reduction can rise up to 12% for a full random data bus and up to 7% for a normal image data flow (i.e., results can reach upper values if buses are longer than the simulated ones (10 mm)). In state-of-the-art optimization techniques, the interconnect length used for experimental results are not often realistic. For instance, the technique used in [7] claims energy consumption reduction for a 7.5 cm bus length. In addition, results presented in [10] show that many coding techniques start to be efficient for very long buses because of the extra consumption due to codecs (e.g., 2 cm for Bus Invert). Moreover, many coding techniques ([8], [9] for instance) do not always take the extra consumption due to codecs into account, when presenting power consumption results for buses.

### C. Effects on Switching Activity

When the codecs are applied on bits that have an average switching activity of 1/2 for data transmission cases (as shown in [13]), it can be noticed that, in average, the activity of the encoded wires is 1/4 as illustrated on Table II. In one cycle, 3/4 transistions occurs in average considering a block of three encoded wires; compared to the activity of 1 of the direct transmission of two bits (two wires with an individual probability of transistion of 1/2).

### D. Effects on Noise

The CECR technique can also bring a significant part in noise reduction on the encoded wires. By considering a wire which remain on a stable logical level, worst cases for noise are when its one or two neighbours are switching in the same direction [16] (i.e., (↑,*GND*, ↑) or (↓,*V*_{dd}, ↓) or (↑,*GND*,−) or (↓, *V*_{dd},−) transitions). Fig. 3 presents the (↑, *GND*,↑) and the (↑, *GND*, −) cases, the victim wire is defined to be the central wire. These unwanted generated voltage noise peak (above *GND* or under *V*_{dd}) can cause errors, if its value is crossing the buffer threshold voltage at the end of the bus. As illustrated in Table III, using the CECR technique can bring a significant reduction of the overall worst cases transitions up to 51% for a random data flow. These results have been obtained by computing the number of transitions when the two neighbours are switching level simultenaously in the same direction.

Based on some previous analysis for interconnect delay and power optimization, a new optimization technique called “Convolutional Encoder for Crosstalk Reduction” (CECR) is presented. This technique aims at lowering as less as possible the switching activity on the most consuming wires (i.e., the LSB) for on-chip data buses. After the presentation of the concept of the technique, one implementation has been proposed. Then, experimental results in terms of power consumption, delay, switching activity and noise reduction using three different technologies and their associated metal layers for different technological parameters variation are presented. Results are presented for technologies and bus length used in todays SoCs. The energy consumption reduction can reach up to 12% for a 10 mm bus in the 65 nm technology and more if buses are longer. It also allows the acceleration of the data propagation by 20% and the reduction of the overall worst noise case transitions by 51%.

### Acknowledgment

This work has been supported by the European Union and the Brittany Region in the context of Programme Objectif 2 Bretagne 2000–2006.