Skip to Main Content
Cart (Loading....) | Create Account | Sign In
Select All on Page | Deselect All
IEEE Journals & Magazines
| Quick Abstract | PDF (10002 KB)
With the continuing scaling of CMOS technology, on-chip temperature and thermal-induced variations have become a major design concern. To effectively limit the high temperature in a chip equipped with a cost-effective cooling system, thermal specific approaches, besides low power techniques, are necessary at the chip design level. The high temperature in hotspots and large thermal gradients are caused by the high local power density and the nonuniform power dissipation across the chip. With the objective of reducing power density in hotspots, we propose two placement techniques that spread cells in hotspots over a larger area. Increasing the area occupied by the hotspot directly reduces its power density, leading to a reduction in peak temperature and thermal gradient. To minimize the introduced overhead in delay and dynamic power, we maintain the relative positions of the coupling cells in the new layout. We compare the proposed methods in terms of temperature reduction, timing, and area overhead to the baseline method, which enlarges the circuit area uniformly. The experimental results showed that our methods achieve a larger reduction in both peak temperature and thermal gradient than the baseline method. The baseline method, although reducing peak temperature in most cases, has little impact on thermal gradient. View full abstract»
IEEE Conference Publications
| Quick Abstract | PDF (745 KB)
Using Field Programmable Gate Arrays (FPGAs) to accelerate financial derivative calculations is becoming very common. In this work, we implement an FPGA-based specific processor for European option pricing using Monte Carlo simulations, and we compare its performance and power dissipation to the execution on a CPU. The experimental results show that impressive results, in terms of speed-up and energy savings, can be obtained by using FPGA-based accelerators at expenses of a longer development time. View full abstract»
| Quick Abstract | PDF (967 KB)
It is well known that the Residue Number System (RNS) provides an efficient implementation of parallel FIR filters especially when the filter order and the dynamic range are high. The two main drawbacks of RNS, need of converters and coding overhead, make a serialized implementation of the FIR filter potentially disadvantageous with respect to filters implemented in the conventional number systems. In this work, we show a number of solutions which demonstrate that the power efficiency of RNS FIR filters implemented serially is maintained in ASIC technology, while in modern FPGA technology RNS implementations are less efficient. View full abstract»
| Quick Abstract | PDF (1321 KB)
Sometimes reducing the precision of a numerical processor, by introducing errors, can lead to significant performance (delay, area and power dissipation) improvements without compromising the overall quality of the processing. In this work, we show how to perform the two basic operations, addition and multiplication, in an imprecise manner by simplifying the hardware implementation. With the proposed “sloppy” operations, we obtain a reduction in delay, area and power dissipation, and the error introduced is still acceptable for applications such as image processing. View full abstract»
IET Journals & Magazines
| Quick Abstract | PDF (55 KB)
For original article see Kaivani, et al., ibid, vol. 5, pp. 393-404 (2011). Lang and Nannarelli comment on the paper of Kaivani, et al., which reported a proposed unit ~46% faster than the unit from their study. Lang and Nannarelli show in this comment that the evaluation done by Kaivani, et al. is based on wrong assumptions and the results of the comparison are erroneous. View full abstract»
| Quick Abstract | PDF (1366 KB)
Although division and square root are not frequent operations, most processors implement them in hardware to not compromise the overall performance. Two classes of algorithms implement division or square root: digit-recurrence and multiplicative (e.g., Newton-Raphson) algorithms. Previous work shows that division and square root units based on the digit-recurrence algorithm offer the best tradeoff delay-area-power. Moreover, the two operations can be combined in a single unit. Here, we present a radix-16 combined division and square root unit obtained by overlapping two radix-4 stages. The proposed unit is compared to similar solutions based on the digit-recurrence algorithm and it is compared to a unit based on the multiplicative Newton-Raphson algorithm. View full abstract»
| Quick Abstract | PDF (564 KB)
Applications in non-conventional number systems can benefit from accelerators implemented on reconfigurable platforms, such as Field Programmable Gate-Arrays (FPGAs). In this paper, we show that applications requiring decimal operations, such as the ones necessary in accounting or financial transactions, can be accelerated by Application Specific Processors (ASPs) implemented on FPGAs. For the case of a telephone billing application, we demonstrate that by accelerating the program execution on a FPGA board connected to the computer by a standard bus, we obtain a significant speed-up over its execution on the CPU of the hosting computer. View full abstract»
| Quick Abstract | PDF (420 KB)
In this paper a review of different techniques used to implement highly optimized DSP systems is presented. The case of study is the implementation of parallel FIR filters aimed to applications characterized by high speed and high selectivity in frequency where at the same time low power dissipation is mandatory. After a review of the possible “standard” optimization techniques, the paper addresses aggressive methodologies where power and area savings are obtained by introducing the concept of “Degrading Precision Arithmetic” (DPA). Three different approaches are discussed: DPA-I, based on selective bit freezing, DPA-II, based on VDD voltage scaling, and DPA-III, based on power gating. Some theoretical/simulative analysis of the introduced arithmetic errors and some implementation results are shown. A discussion on the suitability of these methodologies on standard cell technologies and FPGAs is also addressed. In our experience, these techniques are well known in the scientific community, but they are not extensively known in the design community, and, consequently, they are scarcely utilized. View full abstract»
| Quick Abstract | PDF (214 KB)
Field Programmable Gate-Arrays (FPGAs) can efficiently implement application specific processors in non-conventional number systems, such as the decimal (Binary-Coded Decimal, or BCD) number system required for accounting accuracy in financial applications. The main purpose of this work is to show that applications requiring several decimal (BCD) operations can be accelerated by a processor implemented on a FPGA board connected to the computer by a standard bus. For the case of a telephone billing application, we demonstrate that even a basic implementation of the decimal processor on the FPGA, without an advanced input/output interface, can achieve a speed-up of about 10 over its execution on the CPU of the hosting computer. View full abstract»
| Quick Abstract | PDF (643 KB)
Due to large variations in temperature in VLSI circuits and the linear relationship between metal resistance and temperature, the delay through wires of the same length can be different. Traditional thermal aware floorplanning algorithms use wirelength to estimate delay and routability. In this work, we show that using wirelength as the evaluation metric does not always produce a floorplan with the shortest delay. We propose a temperature dependent wire delay estimation method for thermal aware floorplanning algorithms, which takes into account the thermal effect on wire delay. The experiment results show that a shorter delay can be achieved using the proposed method. In addition, we also discuss the congestion and reliability issues as they are closely related to routing and temperature. View full abstract»
| Quick Abstract | PDF (255 KB)
Division and square root, based on the digit-recurrence algorithm, can be implemented in a combined unit. Several implementations of combined division/square root units have been presented mostly for radices 2 and 4. Here, we present a combined radix-16 unit obtained by overlapping two radix-4 result digit selection functions, as it is normally done for division only units. The latency of the unit is reduced by retiming and low power methods are applied as well. The proposed unit is compared to a radix-4 combined division/square root unit, and to a radix-16 unit, obtained by cascading two radix-4 stages, which is similar to the one implemented in a state-of-the-art processor. View full abstract»
| Quick Abstract | PDF (299 KB)
In this work, we revisit the implementation of polyphase filter banks in Quadratic Residue Number System (QRNS) for banks with a large number of channels by developing a new design methodology suitable for large systems required in the new generation of satellites. Furthermore, we compare the QRNS filter bank with an equivalent bank implemented in the traditional Complex Two's Complement System (CTCS) in terms of throughput, area and power dissipation. The results for large filter banks confirm the earnings in power consumption by using the QRNS. View full abstract»
| Quick Abstract | PDF (303 KB)
Fused Multiply-Add (FMA) units are quite popular in floating-point execution units in state-of-the-art multicore processors. It has been shown that, for division operations, using digit-recurrence units consumes much less power and energy than using FMA units which are based on Newton-Raphson approximation algorithms. In this work, we show that digit-recurrence division units can also reduce on chip thermal coupling from hot blocks (e.g. FMAs) to cool blocks such as caches. By placing power efficient dividers between FMAs and a cache block, we lower down the average temperature by 5°C in caches and consequently reduce leakage by 12%. The total power consumption in caches is reduced by 8.44%. View full abstract»
| Quick Abstract | PDF (171 KB)
With increased densities on chips and the growing popularity of multicore processors and general-purpose graphics processing units (GPGPUs) power dissipation and energy consumption pose a serious challenge in the design of system-on-chips (SoCs) and a rise in costs for heat removal. In this work, we analyze the impact of power dissipation in floating-point (FP) units and we consider different alternatives in the implementation of FP-division that lead to substantial energy savings. We compare the implementation of division in a Fused Multiply-Add (FMA) unit based on the Newton-Raphson approximation algorithm to the implementation in a dedicated digit-recurrence unit. The results show a significant reduction of energy in a typical scientific application when the division digit-recurrence unit is used. In addition, we model the thermal behavior of the considered FP-units. View full abstract»
| Quick Abstract | PDF (311 KB)
Sometimes reducing the power dissipation of resource constrained electronic systems, such as those built for deep-space probes or for wearable devices is a top priority. In signal processing, it is possible to have an acceptable quality of the signal even introducing some errors. In this work, we analyze two methods to degrade the precision of arithmetic operations in DSP to save power. The first method is based on disabling the lower (least-significant) portion of the datapath by clock-gating and forcing zeros. The second method is based on lowering the supply voltage and re-designing the carry-chains in the datapath to adapt to the increased delays. View full abstract»
| Quick Abstract | PDF (1215 KB)
With technology scaled to deep submicron era, temperature and temperature gradient have emerged as important design criteria. We propose two post-placement techniques to reduce peak temperature by intelligently allocating whitespace in the hotspots. Both methods are fully compliant with commercial technologies, and can be easily integrated with state-of-the-art thermal-aware design flow. Experiments in a set of tests on circuits implemented in STM 65nm technologies show that our methods achieve better peak temperature reduction than directly increasing circuit's area. View full abstract»
| Quick Abstract | PDF (208 KB)
In this work, we present a radix-10 division unit that is based on the digit-recurrence algorithm and implements binary encodings (binary integer decimal or BID) for significands. Recent decimal division designs are all based on the binary coded decimal (BCD) encoding. We adapt the radix-10 digit-recurrence algorithm to BID representation and implement the division unit in standard cell technology. The implementation of the proposed BID division unit is compared to that of a BCD based unit implementing the same algorithm. The comparison shows that for normalized operands the BID unit has the same latency as the BCD unit and reduced area, but the normalization is more expensive when implemented in BID. View full abstract»
| Quick Abstract | PDF (248 KB)
In this work, we develop an architecture to implement a deblocking filter to improve the quality of video decoded from MPEG. The filter is controlled by the quantization scale parameter, which is derived from the decoded stream based on a novel algorithm. The hardware implementation is targeting an FPGA device similar to those currently used in post-processing units of high end flat panel TV sets. The designed filter shows good performance in terms of robust signal-to-noise ratio and its implementation meets the area and frequency constraints. View full abstract»
| Quick Abstract | PDF (191 KB)
Several algorithms have been developed over the years to reduce the number of additions needed for Multiple Constant Multiplication (MCM) and optimize the area. In this work, we present an approach to MCM which is based on the properties of the Residue Number System (RNS). Experimental results on a set of digital filters, which represent a typical application of MCM, show that the proposed RNS method has a lower power dissipation in most cases, and a reduced area for high throughput filters. View full abstract»
| Quick Abstract | PDF (182 KB)
In this paper, we present the hardware design of a combined decimal and binary floating-point divider, based on specifications in the IEEE 754-2008 Standard for Floating-point Arithmetic. In contrast to most recent decimal divider designs, which are based on the Binary Coded Decimal (BCD) encoding, our divider operates on either 64-bit binary encoded decimal floating-point (DFP) numbers or 64-bit binary floating-point (BFP) numbers. The division approach implemented in our design is based on a digit-recurrence algorithm. We describe the hardware resources shared between the two floating-point datatypes and demonstrate that hardware sharing is advantageous. Compared to a standalone DFP divider, the combined divider has the same worst case delay and 17% more area. View full abstract»
| Quick Abstract | PDF (127 KB)
Floorplanning is becoming more and more important in VLSI design flows, especially for system-on-chip (SoC) designs where IP blocks dominate standard cells. Moreover, in deep sub-micron technologies, where process variations can introduce extra signal skew, it is desirable to have floorplans with balanced net delays to increase the safety margins of the design. In this paper, we investigate the properties of floorplanning based on the elastic energy model. The B*-tree, which is based on an ordered binary tree, is used for circuit representation and the elastic energy is used as the cost function. To evaluate how well a net is balanced, we introduced a new metric 'unbalancing'. A more balanced net would have a smaller 'unbalancing' value. Experimental results show that our approach can not only meet fixed-outline constraints, but also achieve significant improvements in net balance for all the circuits in the MCNC benchmark. View full abstract»
| Quick Abstract | PDF (50 KB)
| Quick Abstract | PDF (680 KB)
We consider the problem of adding the partial products in the combinational decimal multiplier presented by Lang and Nannarelli. In the original paper this addition is done with a tree of decimal carry-save adders. In this paper, we treat the problem using the multi-operand decimal addition previously published by Dadda, where the sum of each column of the partial product array is obtained first in binary form and then converted to decimal. The multiplication, using a 90 nm CMOS technology, in this modified scheme takes 2.51 ns, while in the original scheme it takes 2.65 ns. The area of the two schemes is roughly the same. View full abstract»
| Quick Abstract | PDF (989 KB)
Low cost microprocessors and DSPs are optimized to perform general arithmetic and logic operations on native wordlength. On the other hand, the efficiency decreases when they process shorter data (more clock cycles per operation are required). Recently different solutions have been proposed to overcome this problem. Among those, the one based on a main processor with a reconfigurable unit (RU) used as coprocessor (to speed up fine grained operations) is the most common. Typically those coprocessors, similar to FPGA, are composed by look-up tables (LUTs) and pass transistors interconnects. In this way, due to the great number of reconfiguration bits, it is impossible to obtain together a run-time reconfiguration and an efficient implementation, avoiding idle hardware resources . This paper proposes a new dynamic reconfigurable architecture that can be embedded in microprocessors or low cost DSPs to accelerate the execution of the above mentioned operations. The goal of ADAPTO (adder-based dynamic architecture for processing tailored operators) is to reduce the hardware complexity and the reconfiguration time, with respect to typical LUT based reconfigurable array. ADAPTO supports both hardware reconfiguration and instruction execution in the same processor clock cycle. This goal has been obtained by using a new reconfigurable unit based on full adders, instead LUTs, and simplifying the network interconnect. View full abstract»
| Quick Abstract | PDF (257 KB)
A few classes of algorithms to implement division in hardware have been used over the years: division by digit-recurrence, by reciprocal approximation by iterative methods and by polynomial approximation. Due to the differences in the algorithms, a comparison among their implementation in terms of performance and precision is sometimes hard to make. In this work, we use power dissipation and energy consumption as metrics to compare among those different classes of algorithms. There are no previous works in the literature presenting such a comparison. View full abstract»
| Quick Abstract | PDF (267 KB)
Fast accumulation is required for units such as direct digital frequency synthesis (DDFS) processors which, together with a digital to analog converter, generate periodic waveforms. In these units, waveforms with high frequency resolution are obtained if the clocking frequency of the digital processor is high (GHz range in today's technologies). Accumulators necessary for DDFS are then deeply pipelined down to the bit-level with two main consequences: high power dissipation, due to the large number of latches/flip-flops, and large latency dependent on the granularity of the applied pipelining. In this work, we address the two issues of reducing the power dissipation in the accumulator by applying selective clock gating, and reducing the accumulation latency by pipelining the adder to adapt the delay of the carry-chain to the necessary clock period. View full abstract»
| Quick Abstract | PDF (1109 KB)
In this work we extend a previously proposed digit- recurrence radix-10 division unit to be able to perform also radix-16 division. The extension is simplified by the fact that in the radix-10 implementation the quotient digit is decomposed into two parts and that this decomposition is also appropriate for the radix-16 case. Moreover, to reduce the latency in the radix- 10 the most-significant portion of the datapath, including the selection function, has been implemented in radix-2, so that the modifications of that part to include radix-16 consists mainly in combining the two modules to obtain the selection constants. The rest of the modifications relate to the generation of multiples, to the carry-save adder, to the carry-propagate adder, and to the on-the-fly conversion and rounding. The implementation results show that the delay of an iteration is similar to that of the radix-10 case and that the area is about thirty percent larger. View full abstract»
| Quick Abstract | PDF (212 KB)
In this paper a low-power implementation of an adaptive FIR filter is presented. The filter is designed to meet the constraints of channel equalization for fixed wireless communications that typically requires a large number of taps, but a serial updating of the filter coefficients, based on the least mean squares (LMS) algorithm, is allowed. Previous work showed that the use of the residue number system (RNS) for the variable FIR filter grants advantages both in area and power consumption. On the other hand, the use of a binary serial implementation of the adaptation algorithm eliminates the need for complex scaling circuits in RNS. The advantages in terms of area and speed of the presented filter, with respect to its two's complement counterpart, are evaluated for implementations in standard cells. View full abstract»
| Quick Abstract | PDF (1763 KB)
In this paper a design space exploration for FIR filter implementations in residue number system (RNS) is presented. The exploration regards different aspects of the RNS FIR filter designsuch as the dynamic range, the overhead due to the coding of the RNS base with respect to the application dynamic range, and delay-area tradeoffs. The design space exploration and its results, are helpful in evaluating the effects of the RNS coding overhead and to choose an efficient filter architecture trading-off filter order, dynamic range, clock frequency and area. View full abstract»
| Quick Abstract | PDF (4187 KB)
In previous works (Cardarilli et al., 2000) we performed different experiments implementing FIR filtering structures. Each filter was implemented using both the two's complement system (TCS) and the residue number system (RNS) number representations. The comparison of these two implementations allows to conclude that, for these applications, the RNS uses less power than the TCS counterpart. The aim of the present paper is to highlight the reasons of this power consumption reduction. View full abstract»
| Quick Abstract | PDF (5054 KB)
In this work, we present a radix-10 division unit that is based on the digit-recurrence algorithm. The previous decimal division designs do not include recent developments in the theory and practice of this type of algorithm, which were developed for radix-2k dividers. In addition to the adaptation of these features, the radix-10 quotient digit is decomposed into a radix-2 digit and a radix-5 digit in such a way that only five and two times the divisor are required in the recurrence. Moreover, the most significant slice of the recurrence, which includes the selection function, is implemented in radix-2, avoiding the additional delay introduced by the radix-10 carry-save additions and allowing the balancing of the paths to reduce the cycle delay. The results of the implementation of the proposed radix-10 division unit show that its latency is close to that of radix-16 division units (comparable dynamic range of significant) and it has a shorter latency than a radix-10 unit based on the Newton-Raphson approximation View full abstract»
| Quick Abstract | PDF (261 KB)
In this work, we present a combinational decimal multiply unit which can be pipelined to reach the desired throughput. With respect to previous implementations of decimal multiplication, the proposed unit is combinational (parallel) and not sequential, has a simpler recoding of the operands which reduces the number of partial product precomputations and uses counters to eliminate the need of the decimal equivalent of a 4:2 adder. The results of the implementation show that the combinational decimal multiplier offers a good compromise between latency and area when compared to other decimal multiply units and to binary double-precision multipliers. View full abstract»
| Quick Abstract | PDF (329 KB)
In this work a hybrid residue number system (RNS) implementation of an adaptive FIR filter is presented. The used adaptation algorithm is the least mean squares (LMS). The filter has been designed to meet the constraints of specific class of applications. In fact, it is suitable for applications requiring a large number of taps where a serial updating of the filter coefficients is feasible (channel equalization or echo cancellation). In the literature, it has been shown that the RNS implementation of FIR filters grants earnings in area ad power consumption due to the introduced arithmetic simplifications. Vice versa, the RNS implementation of the adaptation algorithm needs scaling circuits that are complex and expensive in RNS arithmetic. For this reason, a serial binary implementation of the adaptation algorithm is chosen. The advantages in terms of area and speed of the RNS adaptive filter with respect to the two's complement one have been evaluated for a standard cells implementation. View full abstract»
| Quick Abstract | PDF (213 KB)
The reciprocal operation 1/d is a frequent operation performed in graphics processors (GPUs). In this work, we present the design of a radix-16 reciprocal unit based on the algorithm combining the traditional digit-by-digit algorithm and the approximation of the reciprocal by one Newton-Raphson iteration. We design a fully pipelined single-precision unit to be used in GPUs. The results of the implementation show that the proposed unit can sustain a higher throughput than that of a unit implementing the normal Newton-Raphson approximation, and its area is smaller. View full abstract»
| Quick Abstract | PDF (800 KB)
The reciprocal and square-root reciprocal operations are important in several applications. For these operations, we present algorithms that combine a digit-by-digit module and one iteration of a quadratic-convergence approximation. The latter is implemented by a digit-recurrence, which uses the digits produced by the digit-by-digit part. In this way, both parts execute in an overlapped manner, so that the total number of cycles is about half of the number that would be required by the digit-by-digit part alone. Because of the approximation, correct rounding of the result cannot be obtained directly in all cases; we propose a variable-time implementation that produces the correctly rounded result with a small average overhead. Radix-4 implementations are described and have been synthesized. They achieve the same cycle time as the standard digit-by-digit implementation, resulting in a speed-up of about 2 and, because of the approximation part, the area factor is also about 2. We also show a combined implementation for both operations that has essentially the same complexity as that for square-root reciprocal alone. View full abstract»
| Quick Abstract | PDF (351 KB)
| Quick Abstract | PDF (176 KB)
The scaling operation, i.e. the division by a constant factor followed by rounding, is a commonly used technique for reducing the dynamic range in digital signal processing (DSP) systems. Usually, the constant is a power of two, and the implementation of the scaling is reduced to a right shift. This basic operation is not easily implementable in the residue number system (RNS) due to its non positional nature. A number of different algorithms have been presented in the literature for the RNS scaling. In this paper, several RNS dynamic reduction techniques have been analyzed and the selected one is applied to a polyphase filter bank. A comparison of the filter bank scaled with RNS to binary and binary to RNS conversions, and the RNS scaled implementation is presented. A reduction of area and power consumption of about 30% for the scaling block is obtained. View full abstract»
| Quick Abstract | PDF (1943 KB)
In this paper, we propose a class of division algorithms with the aim of reducing the delay of the selection of the quotient digit by introducing more concurrency and flexibility in its computation. From the proposed class of algorithms, we select one that moves part of the selection function out of the critical path, with a corresponding reduction in the critical path compared with existing alternatives: we present the algorithm and describe the architectures for radix 4 and for radix 16. For radix 16, we use the scheme of overlapping two radix-4 stages. In both cases, radix 4 and radix 16, we show that our algorithms allow the design of units with well-balanced critical paths with consequent decreases of the cycle times. Moreover, in the radix-16 case, we include some additional speculation techniques. To estimate the speedup, we used a rough timing model based on logical effort. For both radices, we estimate a speedup of about 25 percent with respect to previous implementations. In the radix-4 case, this is achieved by using roughly the same area, while, in the radix-16 case, the area is increased by about 30 percent. We verified our estimations by performing a synthesis of the radix-4 units. View full abstract»
| Quick Abstract | PDF (250 KB)
The aim of this work is the reduction of the power dissipated in digital filters, while maintaining the timing unchanged. A polyphase filter bank in the Quadratic Residue Number System (QRNS) has been implemented and then compared, in terms of performance, area, and power dissipation to the implementation of a polyphase filter bank in the traditional two's complement system (TCS). The resulting implementations, designed to have the same clock rates, show that the QRNS filter is smaller and consumes less power than the TCS one. View full abstract»
| Quick Abstract | PDF (204 KB)
Although digital filters based on the residue number system (RNS) show high performance and low power dissipation, RNS filters are not widely used in DSP systems, because of the complexity of the algorithms involved. We present a tool to design RNS FIR filters which hides the RNS algorithms to the designer, and generates a synthesizable VHDL description of the filter taking into account several design constraints such as: delay, area, and energy. View full abstract»
| Quick Abstract | PDF (1033 KB)
| Quick Abstract | PDF (307 KB)
In this paper we present some tradeoffs between delay and power consumption in the design of digital processors based on the Residue Number System (RNS). We focus on reducing the switching capacitance, and therefore the power, in modular adders and isomorph multipliers. Results on architectures such as FIR filters, show that the techniques used to reduce the switching capacitance not only lead to more power efficient circuits, but also to a better performance. View full abstract»
| Quick Abstract | PDF (391 KB)
Since a large portion of the critical path in an implementation of radix-4 division corresponds to the delay of the quotient-digit selection module, it is of interest to reduce this delay. The proposal of this paper extends the approach presented recently of prestoring the selection constants corresponding to the actual value of the divisor and to perform the determination of the quotient digit by carry-free subtraction and sign detection. This extension consists in advancing the subtraction so that it is outside of the critical path. This advancement also provides the possibility of placing the registers so as to minimize the cycle time. We present the method and report results of synthesis using a family of standard cells. We conclude that the extension results in a speedup of 1.35 with respect to the basic implementation and of 1.3 with respect to the previously mentioned approach. We estimate that the areas of all three units are about the same. View full abstract»
| Quick Abstract | PDF (425 KB)
In this paper we describe a possible approach to implement a reconfigurable datapath for digital signal processing. The datapath should be programmable in terms of dynamic range, type and sequence of operations. We chose to implement it in the Residue Number System (RNS), because the RNS offers high speed and low power dissipation. Results show that the RNS reconfigurable datapath offers better performance and lower power dissipation when compared, on the same set of applications, with a traditional FIR filter of the same characteristics View full abstract»
| Quick Abstract | PDF (585 KB)
Energy consumption of the processor-to-memory path normally accounts for a large fraction of the total energy budget of modern embedded systems. A novel approach for reducing energy consumption in core processors used in systems with cache-based architectures is present. In this scheme, instructions are fetched and stored in the I-cache in compressed form. The beneficial effect is an increase of the cache hit ratio; therefore, the number of accesses to the main memory is reduced, and so is the energy required to fetch the instructions. Static code size reduction is achieved as a by-product. A hardware decompression unit performs fast low-energy on-the-fly instruction decompression at each cache look-up. The decompressor is placed outside the core boundaries: therefore, processor architecture does not need any modification, making the proposed compression approach suitable to JP-based designs. The viability and effectiveness of this solution is assessed through extensive benchmarking performed on a number of typical embedded programs. Beside code size, energy and performance optimisation results, the authors also report data regarding the synthesis and implementation of the decompression unit. The energy penalty it introduces is taken into account in the evaluation of the achieved energy savings View full abstract»
| Quick Abstract | PDF (445 KB)
The evaluation of power consumption in complex digital systems is a hard task that normally requires long simulation time and complicated models. In this work, we obtain power consumption estimates from the measurement of the average current absorption of digital filters mapped on a field programmable gate array (FPGA). We also compare the measurements made with the results previously obtained for a standard cells implementation of the same filters. Moreover, we explore the possibility of carrying out measurements of other electrical parameters on hardware to extract information on a system, instead of simulating its behavior with complicated models. View full abstract»
| Quick Abstract | PDF (620 KB)
This paper contributes a novel approach for reducing static code size and instruction fetch energy for cache-based core processors running embedded applications. Our implementation of the decompression unit guarantees fast and low-energy, on-the-fly instruction decompression at each cache lookup. The decompressor is placed outside the core boundaries; therefore, processor architecture does not need any modification, making the proposed compression approach suitable to IP-based designs. Viability of our solution is assessed through extensive benchmarking performed on a number of typical embedded programs View full abstract»
| Quick Abstract | PDF (280 KB)
The use of the Residue Number System (RNS) in modern telecommunication and multimedia applications is becoming more and more important because it allows interesting advantages in terms of precision, power consumption and speed. Generally, the output conversion from residue to binary is the crucial point in effective realizations of application specific architectures based on residual arithmetic. This paper presents a general conversion procedure based on a N moduli set. The algorithm can process both unsigned and signed numbers. Based on this algorithm an architecture which efficiently implements the output conversion is illustrated. The architecture has been mapped on a FPGA View full abstract»
| Quick Abstract | PDF (413 KB)
In this work, we present the implementation of a finite impulse response (FIR) filter in the residue number system (RNS), in which we use a carry-save scheme in the binary representation of the residues to speed-up modular additions. We compare the carry-save RNS implementation with the implementations of the same filter in the traditional binary system and in plain RNS. Results show that the carry-save RNS filter is much faster and its energy dissipation per cycle comparable. Furthermore, we show that a multiple supply voltage approach for the plain RNS filter can lead to an additional reduction in power dissipation without performance degradation. View full abstract»
| Quick Abstract | PDF (268 KB)
In this work, a study on the implementation of FIR filters in the Residue Number System (RNS) is carried out. For different configurations, RNS filters are compared with filters realized in the traditional two's complement system (TCS) in terms of delay, area and power dissipation. The resulting implementations show that the RNS filters are smaller and consume less power than the corresponding ones in TCS, when the number of taps is larger than sixteen View full abstract»
| Quick Abstract | PDF (520 KB)
In this paper, an application of fast prototyping techniques to the hardware simulation of telecommunication systems is shown. In particular, a hardware simulator, based on a FPGA board, of a nonstationary satellite link for mobile communication is developed. For this kind of system, the complexity of the software simulation in term of computation time is often unacceptable, especially when the estimation of global quality factors, such as the bit error rate (BER), must be carried out. The hardware simulation of the system guarantees the reduction of the simulation time by several orders of magnitude. View full abstract»
| Quick Abstract | PDF (288 KB)
The aim of this work is to reduce the power dissipated in high order finite impulse response (FIR) filters, while maintaining the delay unchanged. We compare in terms of performance, area, and power dissipation the implementation of a traditional FIR filter with a residue number system (RNS) based one. The resulting implementations, designed to work at the same clock rate, show that the RNS filter is smaller and consumes less power than the traditional one for a number of taps larger than eight View full abstract»
| Quick Abstract | PDF (355 KB)
This paper compares in terms of performance, area and power dissipation, a complex FIR filter realized in the traditional two's complement system with a Quadratic Residue Number System (QRNS) based one. The resulting implementations, designed to work at the same clock rate, show that the QRNS filter is almost half the size of the traditional one, and dissipates about one third of the energy. View full abstract»
| Quick Abstract | PDF (116 KB)
Because of the similarities in the algorithm it is quite common to implement division and square root in the same unit. The purpose of this work is to implement a low-power combined radix-4 division and square root floating-point double precision unit and to compare its performance and energy consumption with a radix-4 division only unit. Previous work has been done on reducing the energy dissipated in a divider. Here we apply the same techniques to the combined division and square root unit and consider modifications and tradeoffs. Results show that the energy dissipation for the combined division/square root unit can be reduced by about 35% without affecting the latency and an additional 20% reduction can be obtained using a dual voltage. Moreover the unit is 5% slower than a divider and its energy dissipation is 15% higher View full abstract»
| Quick Abstract | PDF (452 KB)
The general objective of our work is to develop methods to reduce the energy consumption of arithmetic modules while maintaining the delay unchanged and keeping the increase in the area to a minimum. Here, we illustrate some techniques for dividers realized in CMOS technology. The energy dissipation reduction is carried out at different levels of abstraction: from the algorithm level down to the implementation, or gate, level. We describe the use of techniques such as switching-off not active blocks, retiming, dual voltage, and equalizing the paths to reduce glitches. Also, we describe modifications in the on-the-fly conversion and rounding algorithm and in the redundant representation of the residual in order to reduce the energy dissipation. The techniques and modifications mentioned above are applied to a radix-4, divider, realized with static CMOS standard cells, for which a reduction of 40 percent is obtained with respect to the standard implementation. This reduction is expected to be about 60 percent if low-voltage gates, for dual voltage implementation, are available. The techniques used here should be applicable to a variety of arithmetic modules which have similar characteristics View full abstract»
| Quick Abstract | PDF (184 KB)
Although division is less frequent than addition and multiplication, because of its longer latency it dissipates a substantial part of the energy in floating-point units. In this paper we explore the relation between the radix and the energy dissipated. Previous work has been done an radix-4 and radix-8 division. Here we extend this study to a radix-4 scheme with two overlapped radix-4 stages and compare the latency, area, and energy of the three implementations. Results show that by applying the low-power techniques the energy dissipation is reduced from 30% to 40%, with respect to the standard implementation. An additional 20% reduction can be obtained using a dual voltage. Moreover the energy dissipated to complete the division is roughly the same for the three radices. However, the power dissipation, proportional to the average current, increases with the radix. If reducing the energy is the priority, for the same latency radix-16 with dual voltage produces the smallest energy dissipation View full abstract»
| Quick Abstract | PDF (108 KB)
This work describes the design of a double-precision radix-8 divider. Low-power techniques are applied in the design of the unit, and energy-delay tradeoffs considered. The energy dissipation in the divider can be reduced by up to 70% with respect to a standard implementation not optimized for energy, without penalizing the latency. The radix-8 divider is compared with the one obtained by overlapping three radix-2 stages and with a radix-4 divider. Results show that the latency of our divider is similar to that of the divider with overlapped stages, but the area is smaller. The speed-up of the radix-8 over the radix-4 is about 20% and the energy dissipated to complete a division is almost the same, although the area of the radix-8 is 50% larger View full abstract»
| Quick Abstract | PDF (300 KB)
The use of higher radices in division reduces the number of iterations to complete the operation, but increases the complexity of the circuit. In this paper we explore the influence of the radix on the power dissipation of a floating-point divider and the power-delay tradeoffs. We compare the performance and the energy consumption per operation for a radix-4 and a radix-8 divider, realized in CMOS technology. A reduction of about 40% in the energy consumption is obtained for both radices (about 70% if low-voltage gates, for dual voltage implementation, are available). Also the results show that the radix-8 divider is about 20% faster and the energy dissipated to perform a division is about the same, with respect to the radix-4. View full abstract»
| Quick Abstract | PDF (272 KB)
The general objective of our work is to develop methods to reduce the power consumption of arithmetic modules, while maintaining the delay unchanged and keeping the increase in the area to a minimum. Here we illustrate some techniques for a radix-4 divider realized in 0.6 μm CMOS technology. Using techniques such as switching-off not active blocks, retiming the recurrence, equalizing the paths to reduce glitches, using gates with lower drive capability, and changing the redundant representation, we obtained a power consumption reduction of 35% with respect to the standard implementation. The techniques used here should be applicable to a variety of arithmetic modules which have similar characteristics View full abstract»
A not-for-profit organization, IEEE is the world's largest professional association for the advancement of technology. © Copyright 2013 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.
Back to Top