Close category search window
 
Skip to Results

Search Results

You searched for: nannarelli
59 Results returned
Skip to Results
  • Save this Search
  • Download Citations Disabled
  • Save To Project
  • Email
  • Print
  • Export Results
  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Layout-Driven Post-Placement Techniques for Temperature Reduction and Thermal Gradient Minimization

    Wei Liu ; Calimera, A. ; Macii, A. ; Macii, E. ; Nannarelli, A. ; Poncino, M.
    Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on

    Volume: 32 , Issue: 3
    Digital Object Identifier: 10.1109/TCAD.2012.2228267
    Publication Year: 2013 , Page(s): 406 - 418

    IEEE Journals & Magazines

    With the continuing scaling of CMOS technology, on-chip temperature and thermal-induced variations have become a major design concern. To effectively limit the high temperature in a chip equipped with a cost-effective cooling system, thermal specific approaches, besides low power techniques, are necessary at the chip design level. The high temperature in hotspots and large thermal gradients are caused by the high local power density and the nonuniform power dissipation across the chip. With the objective of reducing power density in hotspots, we propose two placement techniques that spread cells in hotspots over a larger area. Increasing the area occupied by the hotspot directly reduces its power density, leading to a reduction in peak temperature and thermal gradient. To minimize the introduced overhead in delay and dynamic power, we maintain the relative positions of the coupling cells in the new layout. We compare the proposed methods in terms of temperature reduction, timing, and area overhead to the baseline method, which enlarges the circuit area uniformly. The experimental results showed that our methods achieve a larger reduction in both peak temperature and thermal gradient than the baseline method. The baseline method, although reducing peak temperature in most cases, has little impact on thermal gradient. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Design of power efficient FPGA based hardware accelerators for financial applications

    Hegner, J.S. ; Sindholt, J. ; Nannarelli, A.
    NORCHIP, 2012

    Digital Object Identifier: 10.1109/NORCHP.2012.6403096
    Publication Year: 2012 , Page(s): 1 - 4

    IEEE Conference Publications

    Using Field Programmable Gate Arrays (FPGAs) to accelerate financial derivative calculations is becoming very common. In this work, we implement an FPGA-based specific processor for European option pricing using Monte Carlo simulations, and we compare its performance and power dissipation to the execution on a CPU. The experimental results show that impressive results, in terms of speed-up and energy savings, can be obtained by using FPGA-based accelerators at expenses of a longer development time. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Power efficient design of parallel/serial FIR filters in RNS

    Petricca, M. ; Albicocco, P. ; Cardarilli, G.C. ; Nannarelli, A. ; Re, M.
    Signals, Systems and Computers (ASILOMAR), 2012 Conference Record of the Forty Sixth Asilomar Conference on

    Digital Object Identifier: 10.1109/ACSSC.2012.6489171
    Publication Year: 2012 , Page(s): 1015 - 1019

    IEEE Conference Publications

    It is well known that the Residue Number System (RNS) provides an efficient implementation of parallel FIR filters especially when the filter order and the dynamic range are high. The two main drawbacks of RNS, need of converters and coding overhead, make a serialized implementation of the FIR filter potentially disadvantageous with respect to filters implemented in the conventional number systems. In this work, we show a number of solutions which demonstrate that the power efficiency of RNS FIR filters implemented serially is maintained in ASIC technology, while in modern FPGA technology RNS implementations are less efficient. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Imprecise arithmetic for low power image processing

    Albicocco, P. ; Cardarilli, G.C. ; Nannarelli, A. ; Petricca, M. ; Re, M.
    Signals, Systems and Computers (ASILOMAR), 2012 Conference Record of the Forty Sixth Asilomar Conference on

    Digital Object Identifier: 10.1109/ACSSC.2012.6489164
    Publication Year: 2012 , Page(s): 983 - 987

    IEEE Conference Publications

    Sometimes reducing the precision of a numerical processor, by introducing errors, can lead to significant performance (delay, area and power dissipation) improvements without compromising the overall quality of the processing. In this work, we show how to perform the two basic operations, addition and multiplication, in an imprecise manner by simplifying the hardware implementation. With the proposed “sloppy” operations, we obtain a reduction in delay, area and power dissipation, and the error introduced is still acceptable for applications such as image processing. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Comments on 'improving the speed of decimal division'

    Lang, T. ; Nannarelli, A.
    Computers & Digital Techniques, IET

    Volume: 6 , Issue: 6
    Digital Object Identifier: 10.1049/iet-cdt.2012.0090
    Publication Year: 2012 , Page(s): 370 - 371

    IET Journals & Magazines

    For original article see Kaivani, et al., ibid, vol. 5, pp. 393-404 (2011). Lang and Nannarelli comment on the paper of Kaivani, et al., which reported a proposed unit ~46% faster than the unit from their study. Lang and Nannarelli show in this comment that the evaluation done by Kaivani, et al. is based on wrong assumptions and the results of the comparison are erroneous. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Power Efficient Division and Square Root Unit

    Wei Liu ; Nannarelli, A.
    Computers, IEEE Transactions on

    Volume: 61 , Issue: 8
    Digital Object Identifier: 10.1109/TC.2012.82
    Publication Year: 2012 , Page(s): 1059 - 1070

    IEEE Journals & Magazines

    Although division and square root are not frequent operations, most processors implement them in hardware to not compromise the overall performance. Two classes of algorithms implement division or square root: digit-recurrence and multiplicative (e.g., Newton-Raphson) algorithms. Previous work shows that division and square root units based on the digit-recurrence algorithm offer the best tradeoff delay-area-power. Moreover, the two operations can be combined in a single unit. Here, we present a radix-16 combined division and square root unit obtained by overlapping two radix-4 stages. The proposed unit is compared to similar solutions based on the digit-recurrence algorithm and it is compared to a unit based on the multiplicative Newton-Raphson algorithm. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    FPGA implementation of decimal processors for hardware acceleration

    Borup, N. ; Dindorp, J. ; Nannarelli, A.
    NORCHIP, 2011

    Digital Object Identifier: 10.1109/NORCHP.2011.6126729
    Publication Year: 2011 , Page(s): 1 - 4

    IEEE Conference Publications

    Applications in non-conventional number systems can benefit from accelerators implemented on reconfigurable platforms, such as Field Programmable Gate-Arrays (FPGAs). In this paper, we show that applications requiring decimal operations, such as the ones necessary in accounting or financial transactions, can be accelerated by Application Specific Processors (ASPs) implemented on FPGAs. For the case of a telephone billing application, we demonstrate that by accelerating the program execution on a FPGA board connected to the computer by a standard bus, we obtain a significant speed-up over its execution on the CPU of the hosting computer. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Degrading precision arithmetics for low-power FIR implementation

    Albicocco, P. ; Cardarilli, G.C. ; Nannarelli, A. ; Petricca, M. ; Re, M.
    Circuits and Systems (MWSCAS), 2011 IEEE 54th International Midwest Symposium on

    Digital Object Identifier: 10.1109/MWSCAS.2011.6026265
    Publication Year: 2011 , Page(s): 1 - 4

    IEEE Conference Publications

    In this paper a review of different techniques used to implement highly optimized DSP systems is presented. The case of study is the implementation of parallel FIR filters aimed to applications characterized by high speed and high selectivity in frequency where at the same time low power dissipation is mandatory. After a review of the possible “standard” optimization techniques, the paper addresses aggressive methodologies where power and area savings are obtained by introducing the concept of “Degrading Precision Arithmetic” (DPA). Three different approaches are discussed: DPA-I, based on selective bit freezing, DPA-II, based on VDD voltage scaling, and DPA-III, based on power gating. Some theoretical/simulative analysis of the introduced arithmetic errors and some implementation results are shown. A discussion on the suitability of these methodologies on standard cell technologies and FPGAs is also addressed. In our experience, these techniques are well known in the scientific community, but they are not extensively known in the design community, and, consequently, they are scarcely utilized. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    FPGA Based Acceleration of Decimal Operations

    Nannarelli, A.
    Reconfigurable Computing and FPGAs (ReConFig), 2011 International Conference on

    Digital Object Identifier: 10.1109/ReConFig.2011.39
    Publication Year: 2011 , Page(s): 146 - 151

    IEEE Conference Publications

    Field Programmable Gate-Arrays (FPGAs) can efficiently implement application specific processors in non-conventional number systems, such as the decimal (Binary-Coded Decimal, or BCD) number system required for accounting accuracy in financial applications. The main purpose of this work is to show that applications requiring several decimal (BCD) operations can be accelerated by a processor implemented on a FPGA board connected to the computer by a standard bus. For the case of a telephone billing application, we demonstrate that even a basic implementation of the decimal processor on the FPGA, without an advanced input/output interface, can achieve a speed-up of about 10 over its execution on the CPU of the hosting computer. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Temperature dependent wire delay estimation in floorplanning

    Winther, A.T. ; Wei Liu ; Nannarelli, A. ; Vrudhula, S.
    NORCHIP, 2011

    Digital Object Identifier: 10.1109/NORCHP.2011.6126741
    Publication Year: 2011 , Page(s): 1 - 4
    Cited by 1

    IEEE Conference Publications

    Due to large variations in temperature in VLSI circuits and the linear relationship between metal resistance and temperature, the delay through wires of the same length can be different. Traditional thermal aware floorplanning algorithms use wirelength to estimate delay and routability. In this work, we show that using wirelength as the evaluation metric does not always produce a floorplan with the shortest delay. We propose a temperature dependent wire delay estimation method for thermal aware floorplanning algorithms, which takes into account the thermal effect on wire delay. The experiment results show that a shorter delay can be achieved using the proposed method. In addition, we also discuss the congestion and reliability issues as they are closely related to routing and temperature. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Radix-16 Combined Division and Square Root Unit

    Nannarelli, A.
    Computer Arithmetic (ARITH), 2011 20th IEEE Symposium on

    Digital Object Identifier: 10.1109/ARITH.2011.30
    Publication Year: 2011 , Page(s): 169 - 176
    Cited by 1

    IEEE Conference Publications

    Division and square root, based on the digit-recurrence algorithm, can be implemented in a combined unit. Several implementations of combined division/square root units have been presented mostly for radices 2 and 4. Here, we present a combined radix-16 unit obtained by overlapping two radix-4 result digit selection functions, as it is normally done for division only units. The latency of the unit is reduced by retiming and low power methods are applied as well. The proposed unit is compared to a radix-4 combined division/square root unit, and to a radix-16 unit, obtained by cascading two radix-4 stages, which is similar to the one implemented in a state-of-the-art processor. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Design of large polyphase filters in the Quadratic Residue Number System

    Cardarilli, G.C. ; Nannarelli, A. ; Oster, Y. ; Petricca, M. ; Re, M.
    Signals, Systems and Computers (ASILOMAR), 2010 Conference Record of the Forty Fourth Asilomar Conference on

    Digital Object Identifier: 10.1109/ACSSC.2010.5757589
    Publication Year: 2010 , Page(s): 410 - 413

    IEEE Conference Publications

    In this work, we revisit the implementation of polyphase filter banks in Quadratic Residue Number System (QRNS) for banks with a large number of channels by developing a new design methodology suitable for large systems required in the new generation of satellites. Furthermore, we compare the QRNS filter bank with an equivalent bank implemented in the traditional Complex Two's Complement System (CTCS) in terms of throughput, area and power dissipation. The results for large filter banks confirm the earnings in power consumption by using the QRNS. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Temperature aware power optimization for multicore floating-point units

    Wei Liu ; Nannarelli, A.
    Signals, Systems and Computers (ASILOMAR), 2010 Conference Record of the Forty Fourth Asilomar Conference on

    Digital Object Identifier: 10.1109/ACSSC.2010.5757581
    Publication Year: 2010 , Page(s): 1134 - 1138

    IEEE Conference Publications

    Fused Multiply-Add (FMA) units are quite popular in floating-point execution units in state-of-the-art multicore processors. It has been shown that, for division operations, using digit-recurrence units consumes much less power and energy than using FMA units which are based on Newton-Raphson approximation algorithms. In this work, we show that digit-recurrence division units can also reduce on chip thermal coupling from hot blocks (e.g. FMAs) to cool blocks such as caches. By placing power efficient dividers between FMAs and a cache block, we lower down the average temperature by 5°C in caches and consequently reduce leakage by 12%. The total power consumption in caches is reduced by 8.44%. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Power dissipation challenges in multicore floating-point units

    Wei Liu ; Nannarelli, A.
    Application-specific Systems Architectures and Processors (ASAP), 2010 21st IEEE International Conference on

    Digital Object Identifier: 10.1109/ASAP.2010.5540986
    Publication Year: 2010 , Page(s): 257 - 264
    Cited by 1

    IEEE Conference Publications

    With increased densities on chips and the growing popularity of multicore processors and general-purpose graphics processing units (GPGPUs) power dissipation and energy consumption pose a serious challenge in the design of system-on-chips (SoCs) and a rise in costs for heat removal. In this work, we analyze the impact of power dissipation in floating-point (FP) units and we consider different alternatives in the implementation of FP-division that lead to substantial energy savings. We compare the implementation of division in a Fused Multiply-Add (FMA) unit based on the Newton-Raphson approximation algorithm to the implementation in a dedicated digit-recurrence unit. The results show a significant reduction of energy in a typical scientific application when the division digit-recurrence unit is used. In addition, we model the thermal behavior of the considered FP-units. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Degrading precision arithmetic for low power signal processing

    Petricca, M. ; Cardarilli, G.C. ; Nannarelli, A. ; Re, M. ; Albicocco, P.
    Signals, Systems and Computers (ASILOMAR), 2010 Conference Record of the Forty Fourth Asilomar Conference on

    Digital Object Identifier: 10.1109/ACSSC.2010.5757713
    Publication Year: 2010 , Page(s): 1163 - 1167
    Cited by 1

    IEEE Conference Publications

    Sometimes reducing the power dissipation of resource constrained electronic systems, such as those built for deep-space probes or for wearable devices is a top priority. In signal processing, it is possible to have an acceptable quality of the signal even introducing some errors. In this work, we analyze two methods to degrade the precision of arithmetic operations in DSP to save power. The first method is based on disabling the lower (least-significant) portion of the datapath by clock-gating and forcing zeros. The second method is based on lowering the supply voltage and re-designing the carry-chains in the datapath to adapt to the increased delays. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Post-placement temperature reduction techniques

    Wei Liu ; Nannarelli, A. ; Calimera, A. ; Macii, E. ; Poncino, M.
    Design, Automation & Test in Europe Conference & Exhibition (DATE), 2010

    Digital Object Identifier: 10.1109/DATE.2010.5457127
    Publication Year: 2010 , Page(s): 634 - 637

    IEEE Conference Publications

    With technology scaled to deep submicron era, temperature and temperature gradient have emerged as important design criteria. We propose two post-placement techniques to reduce peak temperature by intelligently allocating whitespace in the hotspots. Both methods are fully compliant with commercial technologies, and can be easily integrated with state-of-the-art thermal-aware design flow. Experiments in a set of tests on circuits implemented in STM 65nm technologies show that our methods achieve better peak temperature reduction than directly increasing circuit's area. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Division Unit for Binary Integer Decimals

    Lang, T. ; Nannarelli, A.
    Application-specific Systems, Architectures and Processors, 2009. ASAP 2009. 20th IEEE International Conference on

    Digital Object Identifier: 10.1109/ASAP.2009.42
    Publication Year: 2009 , Page(s): 1 - 7
    Cited by 2

    IEEE Conference Publications

    In this work, we present a radix-10 division unit that is based on the digit-recurrence algorithm and implements binary encodings (binary integer decimal or BID) for significands. Recent decimal division designs are all based on the binary coded decimal (BCD) encoding. We adapt the radix-10 digit-recurrence algorithm to BID representation and implement the division unit in standard cell technology. The implementation of the proposed BID division unit is compared to that of a BCD based unit implementing the same algorithm. The comparison shows that for normalized operands the BID unit has the same latency as the BCD unit and reduced area, but the normalization is more expensive when implemented in BID. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Hardware implementation of MPEG analysis and deblocking for video enhancement

    Petricca, M. ; Huiying Li ; Forchhammer, S. ; Nannarelli, A. ; Re, M. ; Andersen, J.D. ; Cardarilli, G.C.
    Signals, Systems and Computers, 2009 Conference Record of the Forty-Third Asilomar Conference on

    Digital Object Identifier: 10.1109/ACSSC.2009.5469955
    Publication Year: 2009 , Page(s): 754 - 758

    IEEE Conference Publications

    In this work, we develop an architecture to implement a deblocking filter to improve the quality of video decoded from MPEG. The filter is controlled by the quantization scale parameter, which is derived from the decoded stream based on a novel algorithm. The hardware implementation is targeting an FPGA device similar to those currently used in post-processing units of high end flat panel TV sets. The designed filter shows good performance in terms of robust signal-to-noise ratio and its implementation meets the area and frequency constraints. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Multiple Constant Multiplication through Residue Number System

    Shuli, I. ; Petricca, M. ; Cardarilli, G.C. ; Nannarelli, A. ; Re, M.
    Signals, Systems and Computers, 2009 Conference Record of the Forty-Third Asilomar Conference on

    Digital Object Identifier: 10.1109/ACSSC.2009.5469949
    Publication Year: 2009 , Page(s): 736 - 739
    Cited by 1

    IEEE Conference Publications

    Several algorithms have been developed over the years to reduce the number of additions needed for Multiple Constant Multiplication (MCM) and optimize the area. In this work, we present an approach to MCM which is based on the properties of the Residue Number System (RNS). Experimental results on a set of digital filters, which represent a typical application of MCM, show that the proposed RNS method has a lower power dissipation in most cases, and a reduced area for high throughput filters. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    A combined decimal and binary floating-point divider

    Gonzalez-Navarro, S. ; Nannarelli, A. ; Schulte, M. ; Tsen, S.
    Signals, Systems and Computers, 2009 Conference Record of the Forty-Third Asilomar Conference on

    Digital Object Identifier: 10.1109/ACSSC.2009.5470014
    Publication Year: 2009 , Page(s): 930 - 934

    IEEE Conference Publications

    In this paper, we present the hardware design of a combined decimal and binary floating-point divider, based on specifications in the IEEE 754-2008 Standard for Floating-point Arithmetic. In contrast to most recent decimal divider designs, which are based on the Binary Coded Decimal (BCD) encoding, our divider operates on either 64-bit binary encoded decimal floating-point (DFP) numbers or 64-bit binary floating-point (BFP) numbers. The division approach implemented in our design is based on a digit-recurrence algorithm. We describe the hardware resources shared between the two floating-point datatypes and demonstrate that hardware sharing is advantageous. Compared to a standalone DFP divider, the combined divider has the same worst case delay and 17% more area. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Net Balanced Floorplanning Based on Elastic Energy Model

    Wei Liu ; Nannarelli, A.
    NORCHIP, 2008.

    Digital Object Identifier: 10.1109/NORCHP.2008.4738323
    Publication Year: 2008 , Page(s): 258 - 263

    IEEE Conference Publications

    Floorplanning is becoming more and more important in VLSI design flows, especially for system-on-chip (SoC) designs where IP blocks dominate standard cells. Moreover, in deep sub-micron technologies, where process variations can introduce extra signal skew, it is desirable to have floorplans with balanced net delays to increase the safety margins of the design. In this paper, we investigate the properties of floorplanning based on the elastic energy model. The B*-tree, which is based on an ordered binary tree, is used for circuit representation and the elastic energy is used as the cost function. To evaluate how well a net is balanced, we introduced a new metric 'unbalancing'. A more balanced net would have a smaller 'unbalancing' value. Experimental results show that our approach can not only meet fixed-outline constraints, but also achieve significant improvements in net balance for all the circuits in the MCNC benchmark. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Session TP8b1: Computer arithmetic II

    Nannarelli, Alberto
    Signals, Systems and Computers, 2008 42nd Asilomar Conference on

    Digital Object Identifier: 10.1109/ACSSC.2008.5074733
    Publication Year: 2008 , Page(s): 1782 - 1784

    IEEE Conference Publications

    First Page of the Article
    View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    A variant of a radix-10 combinational multiplier

    Dadda, L. ; Nannarelli, A.
    Circuits and Systems, 2008. ISCAS 2008. IEEE International Symposium on

    Digital Object Identifier: 10.1109/ISCAS.2008.4542181
    Publication Year: 2008 , Page(s): 3370 - 3373
    Cited by 5

    IEEE Conference Publications

    We consider the problem of adding the partial products in the combinational decimal multiplier presented by Lang and Nannarelli. In the original paper this addition is done with a tree of decimal carry-save adders. In this paper, we treat the problem using the multi-operand decimal addition previously published by Dadda, where the sum of each column of the partial product array is obtained first in binary form and then converted to decimal. The multiplication, using a 90 nm CMOS technology, in this modified scheme takes 2.51 ns, while in the original scheme it takes 2.65 ns. The area of the two schemes is roughly the same. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    ADAPTO: full-adder based reconfigurable architecture for bit level operations

    Cardarilli, G.C. ; Di Nunzio, L. ; Re, M. ; Nannarelli, A.
    Circuits and Systems, 2008. ISCAS 2008. IEEE International Symposium on

    Digital Object Identifier: 10.1109/ISCAS.2008.4542197
    Publication Year: 2008 , Page(s): 3434 - 3437
    Cited by 4

    IEEE Conference Publications

    Low cost microprocessors and DSPs are optimized to perform general arithmetic and logic operations on native wordlength. On the other hand, the efficiency decreases when they process shorter data (more clock cycles per operation are required). Recently different solutions have been proposed to overcome this problem. Among those, the one based on a main processor with a reconfigurable unit (RU) used as coprocessor (to speed up fine grained operations) is the most common. Typically those coprocessors, similar to FPGA, are composed by look-up tables (LUTs) and pass transistors interconnects. In this way, due to the great number of reconfiguration bits, it is impossible to obtain together a run-time reconfiguration and an efficient implementation, avoiding idle hardware resources . This paper proposes a new dynamic reconfigurable architecture that can be embedded in microprocessors or low cost DSPs to accelerate the execution of the above mentioned operations. The goal of ADAPTO (adder-based dynamic architecture for processing tailored operators) is to reduce the hardware complexity and the reconfiguration time, with respect to typical LUT based reconfigurable array. ADAPTO supports both hardware reconfiguration and instruction execution in the same processor clock cycle. This goal has been obtained by using a new reconfigurable unit based on full adders, instead LUTs, and simplifying the network interconnect. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Power dissipation in division

    Wei Liu ; Nannarelli, A.
    Signals, Systems and Computers, 2008 42nd Asilomar Conference on

    Digital Object Identifier: 10.1109/ACSSC.2008.5074735
    Publication Year: 2008 , Page(s): 1790 - 1794
    Cited by 1

    IEEE Conference Publications

    A few classes of algorithms to implement division in hardware have been used over the years: division by digit-recurrence, by reciprocal approximation by iterative methods and by polynomial approximation. Due to the differences in the algorithms, a comparison among their implementation in terms of performance and precision is sometimes hard to make. In this work, we use power dissipation and energy consumption as metrics to compare among those different classes of algorithms. There are no previous works in the literature presenting such a comparison. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Reducing power dissipation in pipelined accumulators

    Cardarilli, G.C. ; Nannarelli, A. ; Re, M.
    Signals, Systems and Computers, 2008 42nd Asilomar Conference on

    Digital Object Identifier: 10.1109/ACSSC.2008.5074803
    Publication Year: 2008 , Page(s): 2098 - 2102
    Cited by 1

    IEEE Conference Publications

    Fast accumulation is required for units such as direct digital frequency synthesis (DDFS) processors which, together with a digital to analog converter, generate periodic waveforms. In these units, waveforms with high frequency resolution are obtained if the clocking frequency of the digital processor is high (GHz range in today's technologies). Accumulators necessary for DDFS are then deeply pipelined down to the bit-level with two main consequences: high power dissipation, due to the large number of latches/flip-flops, and large latency dependent on the granularity of the applied pipelining. In this work, we address the two issues of reducing the power dissipation in the accumulator by applying selective clock gating, and reducing the accumulation latency by pipelining the adder to adapt the delay of the carry-chain to the necessary clock period. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Combined Radix-10 and Radix-16 Division Unit

    Lang, T. ; Nannarelli, A.
    Signals, Systems and Computers, 2007. ACSSC 2007. Conference Record of the Forty-First Asilomar Conference on

    Digital Object Identifier: 10.1109/ACSSC.2007.4487363
    Publication Year: 2007 , Page(s): 967 - 971
    Cited by 2

    IEEE Conference Publications

    In this work we extend a previously proposed digit- recurrence radix-10 division unit to be able to perform also radix-16 division. The extension is simplified by the fact that in the radix-10 implementation the quotient digit is decomposed into two parts and that this decomposition is also appropriate for the radix-16 case. Moreover, to reduce the latency in the radix- 10 the most-significant portion of the datapath, including the selection function, has been implemented in radix-2, so that the modifications of that part to include radix-16 consists mainly in combining the two modules to obtain the selection constants. The rest of the modifications relate to the generation of multiples, to the carry-save adder, to the carry-propagate adder, and to the on-the-fly conversion and rounding. The implementation results show that the delay of an iteration is similar to that of the radix-10 case and that the area is about thirty percent larger. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Low-power adaptive filter based on RNS components

    Bernocchi, G.L. ; Cardarilli, G.C. ; Del Re, A. ; Nannarelli, A. ; Re, M.
    Circuits and Systems, 2007. ISCAS 2007. IEEE International Symposium on

    Digital Object Identifier: 10.1109/ISCAS.2007.378155
    Publication Year: 2007 , Page(s): 3211 - 3214
    Cited by 10

    IEEE Conference Publications

    In this paper a low-power implementation of an adaptive FIR filter is presented. The filter is designed to meet the constraints of channel equalization for fixed wireless communications that typically requires a large number of taps, but a serial updating of the filter coefficients, based on the least mean squares (LMS) algorithm, is allowed. Previous work showed that the use of the residue number system (RNS) for the variable FIR filter grants advantages both in area and power consumption. On the other hand, the use of a binary serial implementation of the adaptation algorithm eliminates the need for complex scaling circuits in RNS. The advantages in terms of area and speed of the presented filter, with respect to its two's complement counterpart, are evaluated for implementations in standard cells. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Impact of RNS Coding Overhead on FIR Filters Performance

    Cardarilli, G.C. ; Del Re, A. ; Nannarelli, A. ; Re, M.
    Signals, Systems and Computers, 2007. ACSSC 2007. Conference Record of the Forty-First Asilomar Conference on

    Digital Object Identifier: 10.1109/ACSSC.2007.4487464
    Publication Year: 2007 , Page(s): 1426 - 1429
    Cited by 2

    IEEE Conference Publications

    In this paper a design space exploration for FIR filter implementations in residue number system (RNS) is presented. The exploration regards different aspects of the RNS FIR filter designsuch as the dynamic range, the overhead due to the coding of the RNS base with respect to the application dynamic range, and delay-area tradeoffs. The design space exploration and its results, are helpful in evaluating the effects of the RNS coding overhead and to choose an efficient filter architecture trading-off filter order, dynamic range, clock frequency and area. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Residue Number System for Low-Power DSP Applications

    Cardarilli, G.C. ; Nannarelli, A. ; Re, M.
    Signals, Systems and Computers, 2007. ACSSC 2007. Conference Record of the Forty-First Asilomar Conference on

    Digital Object Identifier: 10.1109/ACSSC.2007.4487461
    Publication Year: 2007 , Page(s): 1412 - 1416
    Cited by 12

    IEEE Conference Publications

    In previous works (Cardarilli et al., 2000) we performed different experiments implementing FIR filtering structures. Each filter was implemented using both the two's complement system (TCS) and the residue number system (RNS) number representations. The comparison of these two implementations allows to conclude that, for these applications, the RNS uses less power than the TCS counterpart. The aim of the present paper is to highlight the reasons of this power consumption reduction. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    A Radix-10 Digit-Recurrence Division Unit: Algorithm and Architecture

    Lang, T. ; Nannarelli, A.
    Computers, IEEE Transactions on

    Volume: 56 , Issue: 6
    Digital Object Identifier: 10.1109/TC.2007.1038
    Publication Year: 2007 , Page(s): 727 - 739
    Cited by 16

    IEEE Journals & Magazines

    In this work, we present a radix-10 division unit that is based on the digit-recurrence algorithm. The previous decimal division designs do not include recent developments in the theory and practice of this type of algorithm, which were developed for radix-2k dividers. In addition to the adaptation of these features, the radix-10 quotient digit is decomposed into a radix-2 digit and a radix-5 digit in such a way that only five and two times the divisor are required in the recurrence. Moreover, the most significant slice of the recurrence, which includes the selection function, is implemented in radix-2, avoiding the additional delay introduced by the radix-10 carry-save additions and allowing the balancing of the paths to reduce the cycle delay. The results of the implementation of the proposed radix-10 division unit show that its latency is close to that of radix-16 division units (comparable dynamic range of significant) and it has a shorter latency than a radix-10 unit based on the Newton-Raphson approximation View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    A Radix-10 Combinational Multiplier

    Lang, T. ; Nannarelli, A.
    Signals, Systems and Computers, 2006. ACSSC '06. Fortieth Asilomar Conference on

    Digital Object Identifier: 10.1109/ACSSC.2006.354758
    Publication Year: 2006 , Page(s): 313 - 317
    Cited by 23

    IEEE Conference Publications

    In this work, we present a combinational decimal multiply unit which can be pipelined to reach the desired throughput. With respect to previous implementations of decimal multiplication, the proposed unit is combinational (parallel) and not sequential, has a simpler recoding of the operands which reduces the number of partial product precomputations and uses counters to eliminate the need of the decimal equivalent of a 4:2 adder. The results of the implementation show that the combinational decimal multiplier offers a good compromise between latency and area when compared to other decimal multiply units and to binary double-precision multipliers. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    A Hybrid RNS Adaptive Filter for Channel Equalization

    Bernocchi, G.L. ; Cardarilli, G.C. ; Del Re, A. ; Nannarelli, A. ; Re, M.
    Signals, Systems and Computers, 2006. ACSSC '06. Fortieth Asilomar Conference on

    Digital Object Identifier: 10.1109/ACSSC.2006.355052
    Publication Year: 2006 , Page(s): 1706 - 1710
    Cited by 1

    IEEE Conference Publications

    In this work a hybrid residue number system (RNS) implementation of an adaptive FIR filter is presented. The used adaptation algorithm is the least mean squares (LMS). The filter has been designed to meet the constraints of specific class of applications. In fact, it is suitable for applications requiring a large number of taps where a serial updating of the filter coefficients is feasible (channel equalization or echo cancellation). In the literature, it has been shown that the RNS implementation of FIR filters grants earnings in area ad power consumption due to the introduced arithmetic simplifications. Vice versa, the RNS implementation of the adaptation algorithm needs scaling circuits that are complex and expensive in RNS arithmetic. For this reason, a serial binary implementation of the adaptation algorithm is chosen. The advantages in terms of area and speed of the RNS adaptive filter with respect to the two's complement one have been evaluated for a standard cells implementation. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    A 1.5 GFLOPS Reciprocal Unit for Computer Graphics

    Nannarelli, A. ; Rasmussen, M.S. ; Stuart, M.B.
    Signals, Systems and Computers, 2006. ACSSC '06. Fortieth Asilomar Conference on

    Digital Object Identifier: 10.1109/ACSSC.2006.355047
    Publication Year: 2006 , Page(s): 1682 - 1686

    IEEE Conference Publications

    The reciprocal operation 1/d is a frequent operation performed in graphics processors (GPUs). In this work, we present the design of a radix-16 reciprocal unit based on the algorithm combining the traditional digit-by-digit algorithm and the approximation of the reciprocal by one Newton-Raphson iteration. We design a fully pipelined single-precision unit to be used in GPUs. The results of the implementation show that the proposed unit can sustain a higher throughput than that of a unit implementing the normal Newton-Raphson approximation, and its area is smaller. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Low latency digit-recurrence reciprocal and square-root reciprocal algorithm and architecture

    Antelo, E. ; Lang, T. ; Montuschi, P. ; Nannarelli, A.
    Computer Arithmetic, 2005. ARITH-17 2005. 17th IEEE Symposium on

    Digital Object Identifier: 10.1109/ARITH.2005.29
    Publication Year: 2005 , Page(s): 147 - 154
    Cited by 1

    IEEE Conference Publications

    The reciprocal and square-root reciprocal operations are important in several applications. For these operations, we present algorithms that combine a digit-by-digit module and one iteration of a quadratic-convergence approximation. The latter is implemented by a digit-recurrence, which uses the digits produced by the digit-by-digit part. In this way, both parts execute in an overlapped manner, so that the total number of cycles is about half of the number that would be required by the digit-by-digit part alone. Because of the approximation, correct rounding of the result cannot be obtained directly in all cases; we propose a variable-time implementation that produces the correctly rounded result with a small average overhead. Radix-4 implementations are described and have been synthesized. They achieve the same cycle time as the standard digit-by-digit implementation, resulting in a speed-up of about 2 and, because of the approximation part, the area factor is also about 2. We also show a combined implementation for both operations that has essentially the same complexity as that for square-root reciprocal alone. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Low Power and Low Leakage Implementation of RNS FIR Filters

    Cardarilli, G.C. ; Del Re, A. ; Nannarelli, A. ; Re, M.
    Signals, Systems and Computers, 2005. Conference Record of the Thirty-Ninth Asilomar Conference on

    Digital Object Identifier: 10.1109/ACSSC.2005.1600042
    Publication Year: 2005 , Page(s): 1620 - 1624
    Cited by 8

    IEEE Conference Publications

    First Page of the Article
    View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Programmable power-of-two RNS scaler and its application to a QRNS polyphase filter

    Cardarilli, G.C. ; Del Re, A. ; Nannarelli, A. ; Re, M.
    Circuits and Systems, 2005. ISCAS 2005. IEEE International Symposium on

    Digital Object Identifier: 10.1109/ISCAS.2005.1464785
    Publication Year: 2005 , Page(s): 1102 - 1105 Vol. 2
    Cited by 3

    IEEE Conference Publications

    The scaling operation, i.e. the division by a constant factor followed by rounding, is a commonly used technique for reducing the dynamic range in digital signal processing (DSP) systems. Usually, the constant is a power of two, and the implementation of the scaling is reduced to a right shift. This basic operation is not easily implementable in the residue number system (RNS) due to its non positional nature. A number of different algorithms have been presented in the literature for the RNS scaling. In this paper, several RNS dynamic reduction techniques have been analyzed and the selected one is applied to a polyphase filter bank. A comparison of the filter bank scaled with RNS to binary and binary to RNS conversions, and the RNS scaled implementation is presented. A reduction of area and power consumption of about 30% for the scaling block is obtained. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Digit-recurrence dividers with reduced logical depth

    Antelo, E. ; Lang, T. ; Montuschi, P. ; Nannarelli, A.
    Computers, IEEE Transactions on

    Volume: 54 , Issue: 7
    Digital Object Identifier: 10.1109/TC.2005.115
    Publication Year: 2005 , Page(s): 837 - 851
    Cited by 12

    IEEE Journals & Magazines

    Multimedia

    In this paper, we propose a class of division algorithms with the aim of reducing the delay of the selection of the quotient digit by introducing more concurrency and flexibility in its computation. From the proposed class of algorithms, we select one that moves part of the selection function out of the critical path, with a corresponding reduction in the critical path compared with existing alternatives: we present the algorithm and describe the architectures for radix 4 and for radix 16. For radix 16, we use the scheme of overlapping two radix-4 stages. In both cases, radix 4 and radix 16, we show that our algorithms allow the design of units with well-balanced critical paths with consequent decreases of the cycle times. Moreover, in the radix-16 case, we include some additional speculation techniques. To estimate the speedup, we used a rough timing model based on logical effort. For both radices, we estimate a speedup of about 25 percent with respect to previous implementations. In the radix-4 case, this is achieved by using roughly the same area, while, in the radix-16 case, the area is increased by about 30 percent. We verified our estimations by performing a synthesis of the radix-4 units. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Low-power implementation of polyphase filters in Quadratic Residue Number system

    Cardarilli, G.C. ; Del Re, A. ; Nannarelli, A. ; Re, M.
    Circuits and Systems, 2004. ISCAS '04. Proceedings of the 2004 International Symposium on

    Volume: 2
    Digital Object Identifier: 10.1109/ISCAS.2004.1329374
    Publication Year: 2004 , Page(s): II - 725-8 Vol.2
    Cited by 2

    IEEE Conference Publications

    The aim of this work is the reduction of the power dissipated in digital filters, while maintaining the timing unchanged. A polyphase filter bank in the Quadratic Residue Number System (QRNS) has been implemented and then compared, in terms of performance, area, and power dissipation to the implementation of a polyphase filter bank in the traditional two's complement system (TCS). The resulting implementations, designed to have the same clock rates, show that the QRNS filter is smaller and consumes less power than the TCS one. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    A tool for automatic generation of RTL-level VHDL description of RNS FIR filters

    Del Re, A. ; Nannarelli, A. ; Re, M.
    Design, Automation and Test in Europe Conference and Exhibition, 2004. Proceedings

    Volume: 1
    Digital Object Identifier: 10.1109/DATE.2004.1268931
    Publication Year: 2004 , Page(s): 686 - 687 Vol.1
    Cited by 3

    IEEE Conference Publications

    Although digital filters based on the residue number system (RNS) show high performance and low power dissipation, RNS filters are not widely used in DSP systems, because of the complexity of the algorithms involved. We present a tool to design RNS FIR filters which hides the RNS algorithms to the designer, and generates a synthesizable VHDL description of the filter taking into account several design constraints such as: delay, area, and energy. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    RNS implementation of high performance filters for satellite demultiplexing

    Cardarilli, G.C. ; Re, A.D. ; Lojacono, R. ; Nannarelli, A. ; Re, M.
    Aerospace Conference, 2003. Proceedings. 2003 IEEE

    Volume: 3
    Digital Object Identifier: 10.1109/AERO.2003.1235253
    Publication Year: 2003 , Page(s): 3_1365 - 3_1379
    Cited by 5

    IEEE Conference Publications

    First Page of the Article
    View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Power-delay tradeoffs in residue number system

    Nannarelli, A. ; Cardarilli, G.C. ; Re, M.
    Circuits and Systems, 2003. ISCAS '03. Proceedings of the 2003 International Symposium on

    Volume: 5
    Digital Object Identifier: 10.1109/ISCAS.2003.1206300
    Publication Year: 2003 , Page(s): V-413 - V-416 vol.5

    IEEE Conference Publications

    In this paper we present some tradeoffs between delay and power consumption in the design of digital processors based on the Residue Number System (RNS). We focus on reducing the switching capacitance, and therefore the power, in modular adders and isomorph multipliers. Results on architectures such as FIR filters, show that the techniques used to reduce the switching capacitance not only lead to more power efficient circuits, but also to a better performance. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Fast radix-4 retimed division with selection by comparisons

    Antelo, E. ; Lang, T. ; Montuschi, P. ; Nannarelli, A.
    Application-Specific Systems, Architectures and Processors, 2002. Proceedings. The IEEE International Conference on

    Digital Object Identifier: 10.1109/ASAP.2002.1030718
    Publication Year: 2002 , Page(s): 185 - 196
    Cited by 3

    IEEE Conference Publications

    Since a large portion of the critical path in an implementation of radix-4 division corresponds to the delay of the quotient-digit selection module, it is of interest to reduce this delay. The proposal of this paper extends the approach presented recently of prestoring the selection constants corresponding to the actual value of the divisor and to perform the determination of the quotient digit by carry-free subtraction and sign detection. This extension consists in advancing the subtraction so that it is outside of the critical path. This advancement also provides the possibility of placing the registers so as to minimize the cycle time. We present the method and report results of synthesis using a family of standard cells. We conclude that the extension results in a speedup of 1.35 with respect to the basic implementation and of 1.3 with respect to the previously mentioned approach. We estimate that the areas of all three units are about the same. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Residue number system reconfigurable datapath

    Cardarilli, G.C. ; Del Re, A. ; Nannarelli, A. ; Re, M.
    Circuits and Systems, 2002. ISCAS 2002. IEEE International Symposium on

    Volume: 2
    Digital Object Identifier: 10.1109/ISCAS.2002.1011463
    Publication Year: 2002 , Page(s): II-756 - II-759 vol.2

    IEEE Conference Publications

    In this paper we describe a possible approach to implement a reconfigurable datapath for digital signal processing. The datapath should be programmable in terms of dynamic range, type and sequence of operations. We chose to implement it in the Residue Number System (RNS), because the RNS offers high speed and low power dissipation. Results show that the RNS reconfigurable datapath offers better performance and lower power dissipation when compared, on the same set of applications, with a traditional FIR filter of the same characteristics View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Code compression architecture for cache energy minimisation in embedded systems

    Benini, L. ; Macii, A. ; Nannarelli, A.
    Computers and Digital Techniques, IEE Proceedings -

    Volume: 149 , Issue: 4
    Digital Object Identifier: 10.1049/ip-cdt:20020467
    Publication Year: 2002 , Page(s): 157 - 163
    Cited by 3

    IET Journals & Magazines

    Energy consumption of the processor-to-memory path normally accounts for a large fraction of the total energy budget of modern embedded systems. A novel approach for reducing energy consumption in core processors used in systems with cache-based architectures is present. In this scheme, instructions are fetched and stored in the I-cache in compressed form. The beneficial effect is an increase of the cache hit ratio; therefore, the number of accesses to the main memory is reduced, and so is the energy required to fetch the instructions. Static code size reduction is achieved as a by-product. A hardware decompression unit performs fast low-energy on-the-fly instruction decompression at each cache look-up. The decompressor is placed outside the core boundaries: therefore, processor architecture does not need any modification, making the proposed compression approach suitable to JP-based designs. The viability and effectiveness of this solution is assessed through extensive benchmarking performed on a number of typical embedded programs. Beside code size, energy and performance optimisation results, the authors also report data regarding the synthesis and implementation of the decompression unit. The energy penalty it introduces is taken into account in the evaluation of the achieved energy savings View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Power characterization of digital filters implemented on FPGA

    Cardarilli, G.C. ; Del Re, A. ; Nannarelli, A. ; Re, M.
    Circuits and Systems, 2002. ISCAS 2002. IEEE International Symposium on

    Volume: 5
    Digital Object Identifier: 10.1109/ISCAS.2002.1010825
    Publication Year: 2002 , Page(s): V-801 - V-804 vol.5
    Cited by 4

    IEEE Conference Publications

    The evaluation of power consumption in complex digital systems is a hard task that normally requires long simulation time and complicated models. In this work, we obtain power consumption estimates from the measurement of the average current absorption of digital filters mapped on a field programmable gate array (FPGA). We also compare the measurements made with the results previously obtained for a standard cells implementation of the same filters. Moreover, we explore the possibility of carrying out measurements of other electrical parameters on hardware to extract information on a system, instead of simulating its behavior with complicated models. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Cached-code compression for energy minimization in embedded processors

    Benini, L. ; Macii, A. ; Nannarelli, A.
    Low Power Electronics and Design, International Symposium on, 2001.

    Digital Object Identifier: 10.1109/LPE.2001.945426
    Publication Year: 2001 , Page(s): 322 - 327
    Cited by 8

    IEEE Conference Publications

    This paper contributes a novel approach for reducing static code size and instruction fetch energy for cache-based core processors running embedded applications. Our implementation of the decompression unit guarantees fast and low-energy, on-the-fly instruction decompression at each cache lookup. The decompressor is placed outside the core boundaries; therefore, processor architecture does not need any modification, making the proposed compression approach suitable to IP-based designs. Viability of our solution is assessed through extensive benchmarking performed on a number of typical embedded programs View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    FPGA realization of RNS to binary signed conversion architecture

    Re, M. ; Nannarelli, A. ; Cardarilli, G.C. ; Lojacono, R.
    Circuits and Systems, 2001. ISCAS 2001. The 2001 IEEE International Symposium on

    Volume: 4
    Digital Object Identifier: 10.1109/ISCAS.2001.922245
    Publication Year: 2001 , Page(s): 350 - 353 vol. 4
    Cited by 2

    IEEE Conference Publications

    The use of the Residue Number System (RNS) in modern telecommunication and multimedia applications is becoming more and more important because it allows interesting advantages in terms of precision, power consumption and speed. Generally, the output conversion from residue to binary is the crucial point in effective realizations of application specific architectures based on residual arithmetic. This paper presents a general conversion procedure based on a N moduli set. The algorithm can process both unsigned and signed numbers. Based on this algorithm an architecture which efficiently implements the output conversion is illustrated. The architecture has been mapped on a FPGA View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Implementation of digital filters in carry-save residue number system

    Del Re, A. ; Nannarelli, A. ; Re, M.
    Signals, Systems and Computers, 2001. Conference Record of the Thirty-Fifth Asilomar Conference on

    Volume: 2
    Digital Object Identifier: 10.1109/ACSSC.2001.987702
    Publication Year: 2001 , Page(s): 1309 - 1313 vol.2
    Cited by 6

    IEEE Conference Publications

    In this work, we present the implementation of a finite impulse response (FIR) filter in the residue number system (RNS), in which we use a carry-save scheme in the binary representation of the residues to speed-up modular additions. We compare the carry-save RNS implementation with the implementations of the same filter in the traditional binary system and in plain RNS. Results show that the carry-save RNS filter is much faster and its energy dissipation per cycle comparable. Furthermore, we show that a multiple supply voltage approach for the plain RNS filter can lead to an additional reduction in power dissipation without performance degradation. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Tradeoffs between residue number system and traditional FIR filters

    Nannarelli, A. ; Re, M. ; Cardarilli, G.C.
    Circuits and Systems, 2001. ISCAS 2001. The 2001 IEEE International Symposium on

    Volume: 2
    Digital Object Identifier: 10.1109/ISCAS.2001.921068
    Publication Year: 2001 , Page(s): 305 - 308 vol. 2
    Cited by 14

    IEEE Conference Publications

    In this work, a study on the implementation of FIR filters in the Residue Number System (RNS) is carried out. For different configurations, RNS filters are compared with filters realized in the traditional two's complement system (TCS) in terms of delay, area and power dissipation. The resulting implementations show that the RNS filters are smaller and consume less power than the corresponding ones in TCS, when the number of taps is larger than sixteen View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Fast prototyping techniques applied to the hardware simulation of telecommunication systems

    Del Re, A. ; Nannarelli, A. ; Re, M.
    Signals, Systems and Computers, 2001. Conference Record of the Thirty-Fifth Asilomar Conference on

    Volume: 2
    Digital Object Identifier: 10.1109/ACSSC.2001.987703
    Publication Year: 2001 , Page(s): 1314 - 1317 vol.2

    IEEE Conference Publications

    In this paper, an application of fast prototyping techniques to the hardware simulation of telecommunication systems is shown. In particular, a hardware simulator, based on a FPGA board, of a nonstationary satellite link for mobile communication is developed. For this kind of system, the complexity of the software simulation in term of computation time is often unacceptable, especially when the estimation of global quality factors, such as the bit error rate (BER), must be carried out. The hardware simulation of the system guarantees the reduction of the simulation time by several orders of magnitude. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Reducing power dissipation in FIR filters using the residue number system

    Cardarilli, G.C. ; Nannarelli, A. ; Re, M.
    Circuits and Systems, 2000. Proceedings of the 43rd IEEE Midwest Symposium on

    Volume: 1
    Digital Object Identifier: 10.1109/MWSCAS.2000.951651
    Publication Year: 2000 , Page(s): 320 - 323 vol.1
    Cited by 11

    IEEE Conference Publications

    The aim of this work is to reduce the power dissipated in high order finite impulse response (FIR) filters, while maintaining the delay unchanged. We compare in terms of performance, area, and power dissipation the implementation of a traditional FIR filter with a residue number system (RNS) based one. The resulting implementations, designed to work at the same clock rate, show that the RNS filter is smaller and consumes less power than the traditional one for a number of taps larger than eight View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Reducing power dissipation in complex digital filters by using the quadratic residue number system

    D'Amora, A. ; Nannarelli, A. ; Re, M. ; Cardarilli, G.C.
    Signals, Systems and Computers, 2000. Conference Record of the Thirty-Fourth Asilomar Conference on

    Volume: 2
    Digital Object Identifier: 10.1109/ACSSC.2000.910639
    Publication Year: 2000 , Page(s): 879 - 883 vol.2
    Cited by 2

    IEEE Conference Publications

    This paper compares in terms of performance, area and power dissipation, a complex FIR filter realized in the traditional two's complement system with a Quadratic Residue Number System (QRNS) based one. The resulting implementations, designed to work at the same clock rate, show that the QRNS filter is almost half the size of the traditional one, and dissipates about one third of the energy. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Low-power radix-4 combined division and square root

    Nannarelli, A. ; Lang, T.
    Computer Design, 1999. (ICCD '99) International Conference on

    Digital Object Identifier: 10.1109/ICCD.1999.808431
    Publication Year: 1999 , Page(s): 236 - 242
    Cited by 3

    IEEE Conference Publications

    Because of the similarities in the algorithm it is quite common to implement division and square root in the same unit. The purpose of this work is to implement a low-power combined radix-4 division and square root floating-point double precision unit and to compare its performance and energy consumption with a radix-4 division only unit. Previous work has been done on reducing the energy dissipated in a divider. Here we apply the same techniques to the combined division and square root unit and consider modifications and tradeoffs. Results show that the energy dissipation for the combined division/square root unit can be reduced by about 35% without affecting the latency and an additional 20% reduction can be obtained using a dual voltage. Moreover the unit is 5% slower than a divider and its energy dissipation is 15% higher View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Low-power divider

    Nannarelli, A. ; Lang, T.
    Computers, IEEE Transactions on

    Volume: 48 , Issue: 1
    Digital Object Identifier: 10.1109/12.743407
    Publication Year: 1999 , Page(s): 2 - 14
    Cited by 14

    IEEE Journals & Magazines

    The general objective of our work is to develop methods to reduce the energy consumption of arithmetic modules while maintaining the delay unchanged and keeping the increase in the area to a minimum. Here, we illustrate some techniques for dividers realized in CMOS technology. The energy dissipation reduction is carried out at different levels of abstraction: from the algorithm level down to the implementation, or gate, level. We describe the use of techniques such as switching-off not active blocks, retiming, dual voltage, and equalizing the paths to reduce glitches. Also, we describe modifications in the on-the-fly conversion and rounding algorithm and in the redundant representation of the residual in order to reduce the energy dissipation. The techniques and modifications mentioned above are applied to a radix-4, divider, realized with static CMOS standard cells, for which a reduction of 40 percent is obtained with respect to the standard implementation. This reduction is expected to be about 60 percent if low-voltage gates, for dual voltage implementation, are available. The techniques used here should be applicable to a variety of arithmetic modules which have similar characteristics View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Low-power division: comparison among implementations of radix 4, 8 and 16

    Nannarelli, A. ; Lang, T.
    Computer Arithmetic, 1999. Proceedings. 14th IEEE Symposium on

    Digital Object Identifier: 10.1109/ARITH.1999.762829
    Publication Year: 1999 , Page(s): 60 - 67
    Cited by 2

    IEEE Conference Publications

    Although division is less frequent than addition and multiplication, because of its longer latency it dissipates a substantial part of the energy in floating-point units. In this paper we explore the relation between the radix and the energy dissipated. Previous work has been done an radix-4 and radix-8 division. Here we extend this study to a radix-4 scheme with two overlapped radix-4 stages and compare the latency, area, and energy of the three implementations. Results show that by applying the low-power techniques the energy dissipation is reduced from 30% to 40%, with respect to the standard implementation. An additional 20% reduction can be obtained using a dual voltage. Moreover the energy dissipated to complete the division is roughly the same for the three radices. However, the power dissipation, proportional to the average current, increases with the radix. If reducing the energy is the priority, for the same latency radix-16 with dual voltage produces the smallest energy dissipation View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Low-power radix-8 divider

    Nannarelli, A. ; Lang, T.
    Computer Design: VLSI in Computers and Processors, 1998. ICCD '98. Proceedings. International Conference on

    Digital Object Identifier: 10.1109/ICCD.1998.727084
    Publication Year: 1998 , Page(s): 420 - 426
    Cited by 3

    IEEE Conference Publications

    This work describes the design of a double-precision radix-8 divider. Low-power techniques are applied in the design of the unit, and energy-delay tradeoffs considered. The energy dissipation in the divider can be reduced by up to 70% with respect to a standard implementation not optimized for energy, without penalizing the latency. The radix-8 divider is compared with the one obtained by overlapping three radix-2 stages and with a radix-4 divider. Results show that the latency of our divider is similar to that of the divider with overlapped stages, but the area is smaller. The speed-up of the radix-8 over the radix-4 is about 20% and the energy dissipated to complete a division is almost the same, although the area of the radix-8 is 50% larger View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Power-delay tradeoffs for radix-4 and radix-8 dividers

    Nannarelli, A. ; Lang, T.
    Low Power Electronics and Design, 1998. Proceedings. 1998 International Symposium on

    Publication Year: 1998 , Page(s): 109 - 111

    IEEE Conference Publications

    The use of higher radices in division reduces the number of iterations to complete the operation, but increases the complexity of the circuit. In this paper we explore the influence of the radix on the power dissipation of a floating-point divider and the power-delay tradeoffs. We compare the performance and the energy consumption per operation for a radix-4 and a radix-8 divider, realized in CMOS technology. A reduction of about 40% in the energy consumption is obtained for both radices (about 70% if low-voltage gates, for dual voltage implementation, are available). Also the results show that the radix-8 divider is about 20% faster and the energy dissipated to perform a division is about the same, with respect to the radix-4. View full abstract»

  • Full text access may be available. Click article title to sign in or learn about subscription options.

    Low-power radix-4 divider

    Nannarelli, A. ; Lang, T.
    Low Power Electronics and Design, 1996., International Symposium on

    Digital Object Identifier: 10.1109/LPE.1996.547508
    Publication Year: 1996 , Page(s): 205 - 208
    Cited by 4

    IEEE Conference Publications

    The general objective of our work is to develop methods to reduce the power consumption of arithmetic modules, while maintaining the delay unchanged and keeping the increase in the area to a minimum. Here we illustrate some techniques for a radix-4 divider realized in 0.6 μm CMOS technology. Using techniques such as switching-off not active blocks, retiming the recurrence, equalizing the paths to reduce glitches, using gates with lower drive capability, and changing the redundant representation, we obtained a power consumption reduction of 35% with respect to the standard implementation. The techniques used here should be applicable to a variety of arithmetic modules which have similar characteristics View full abstract»

Skip to Results

SEARCH HISTORY

Search History is available using your personal IEEE account.

Need Help?


IEEE Advancing Technology for Humanity About IEEE Xplore | Contact | Help | Terms of Use | Nondiscrimination Policy | Site Map | Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest professional association for the advancement of technology.
© Copyright 2013 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.