By Topic

Computer Arithmetic (ARITH), 2011 20th IEEE Symposium on

Date 25-27 July 2011

Filter Results

Displaying Results 1 - 25 of 45
  • [Front cover]

    Publication Year: 2011 , Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (176 KB)  
    Freely Available from IEEE
  • [Title page i]

    Publication Year: 2011 , Page(s): i
    Save to Project icon | Request Permissions | PDF file iconPDF (19 KB)  
    Freely Available from IEEE
  • [Title page iii]

    Publication Year: 2011 , Page(s): iii
    Save to Project icon | Request Permissions | PDF file iconPDF (63 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Publication Year: 2011 , Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (122 KB)  
    Freely Available from IEEE
  • Table of contents

    Publication Year: 2011 , Page(s): v - viii
    Save to Project icon | Request Permissions | PDF file iconPDF (526 KB)  
    Freely Available from IEEE
  • Foreword

    Publication Year: 2011 , Page(s): ix
    Save to Project icon | Request Permissions | PDF file iconPDF (65 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Dedication

    Publication Year: 2011 , Page(s): x - xiv
    Save to Project icon | Request Permissions | PDF file iconPDF (129 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Steering Committee

    Publication Year: 2011 , Page(s): xv
    Save to Project icon | Request Permissions | PDF file iconPDF (65 KB)  
    Freely Available from IEEE
  • Symposium Committee

    Publication Year: 2011 , Page(s): xvi
    Save to Project icon | Request Permissions | PDF file iconPDF (62 KB)  
    Freely Available from IEEE
  • Program Committee

    Publication Year: 2011 , Page(s): xvii
    Save to Project icon | Request Permissions | PDF file iconPDF (79 KB)  
    Freely Available from IEEE
  • Additional Reviewers

    Publication Year: 2011 , Page(s): xviii
    Save to Project icon | Request Permissions | PDF file iconPDF (54 KB)  
    Freely Available from IEEE
  • Corporate Sponsors

    Publication Year: 2011 , Page(s): xix
    Save to Project icon | Request Permissions | PDF file iconPDF (160 KB)  
    Freely Available from IEEE
  • High Intelligence Computing: The New Era of High Performance Computing

    Publication Year: 2011 , Page(s): 3
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (69 KB)  

    This paper discusses about High Performance Computing including the introduction of the fused multiply-add dataflow, and innovations in vector computing and multi processing. This has led to a new era in high performance that has created human intelligence in computers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Short Division of Long Integers

    Publication Year: 2011 , Page(s): 7 - 14
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (239 KB) |  | HTML iconHTML  

    We consider the problem of short division - i.e., approximate quotient - of multiple-precision integers. We present ready-to-implement algorithms that yield an approximation of the quotient, with tight and rigorous error bounds. We exhibit speedups of up to 30% with respect to GMP division with remainder, and up to 10% with respect to GMP short division, with room for further improvements. This work enables one to implement fast correctly rounded division routines in multiple-precision software tools. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • High Degree Toom'n'Half for Balanced and Unbalanced Multiplication

    Publication Year: 2011 , Page(s): 15 - 22
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (412 KB) |  | HTML iconHTML  

    Some hints and tricks to automatically obtain high degree Toom-Cook implementations, i.e. functions for integer or polynomial multiplication with a reduced complexity. The described method generates quite an efficient sequence of operations and the memory footprint is kept low by using a new strategy: mixing evaluation, interpolation and recomposition phases. It is possible to automatise the whole procedure obtaining a general Toom-n function, and to extend the method to polynomials in any characteristic except two. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Augmented Precision Square Roots and 2-D Norms, and Discussion on Correctly Rounding sqrt(x^2+y^2)

    Publication Year: 2011 , Page(s): 23 - 30
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (297 KB) |  | HTML iconHTML  

    Define an "augmented precision" algorithm as an algorithm that returns, in precision-p floating-point arithmetic, its result as the unevaluated sum of two floating-point numbers, with a relative error of the order of 2-2p. Assuming an FMA instruction is available, we perform a tight error analysis of an augmented precision algorithm for the square root, and introduce two slightly different augmented precision algorithms for the 2D-norm √x2+y2. Then we give tight lower bounds on the minimum distance (in ulps) between √x2+y2 and a midpoint when √x2+y2 is not itself a midpoint. This allows us to determine cases when our algorithms make it possible to return correctly-rounded 2D-norms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Towards a Quaternion Complex Logarithmic Number System

    Publication Year: 2011 , Page(s): 33 - 42
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (266 KB) |  | HTML iconHTML  

    The well-known generalization of real to complex arithmetic (two reals) extends further to more obscure quaternion arithmetic (four reals), which has applications in signal processing, aerospace, graphics and virtual reality. Quaternion multiplication implements 3D rotation, but is expensive (usually 16 floating-point multiplications and 12 additions). This paper proposes an alternative quaternion representation using logarithms to reduce multiplication cost. The real Logarithmic Number System (LNS) allows fast and inexpensive multiplication and division in embedded and FPGA-based systems. Recent advances in the Complex LNS (CLNS) have made fast log-polar complex representation affordable. Although the quaternion logarithm function is also well-defined, it is not useful to simplify multiplication (in the same way real and complex logarithms are) because quaternion multiplication is not commutative but quaternion addition is. To overcome this, we propose a novel Quaternion Complex (QCLNS) representation using a pair of CLNS numbers. This representation implements quaternion multiplication using only the theoretical minimum, of 8 LNS multipliers (i.e., fixed-point adders) and two CLNS adders. Because CLNS numbers are more compact than ordinary rectangular complex representation, single-precision QCLNS occupies 10.9 percent less memory than conventional quaternion representation. Extrapolating conventional LNS and floating-point synthesis data from Fu et al., QCLNS saves on average 10 percent of FPGA resources for precisions between 13 and 45 bits. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • ROM-less LNS

    Publication Year: 2011 , Page(s): 43 - 51
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3129 KB) |  | HTML iconHTML  

    The logarithmic number system has been proposed as an alternative to floating-point arithmetic. Multiplication, division and square-root operations are accomplished with fixed-point methods, but addition and subtraction are considerably more challenging. Recent work has demonstrated that these operations too can be done with similar speed and accuracy to their FP equivalents, but the necessary circuitry is complex. In particular, it is dominated by the need for large ROM tables for the storage of non-linear functions. This paper describes two algorithms, a new co-transformation procedure and an improvement to an existing interpolation method, that reduce these tables to an extent that allows their easy synthesis in logic. An implementation shows substantial reductions in area and delay from the previous best 32-bit realisation, with equivalent accuracy. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Composite Iterative Algorithm and Architecture for q-th Root Calculation

    Publication Year: 2011 , Page(s): 52 - 61
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (299 KB) |  | HTML iconHTML  

    An algorithm for the q-th root extraction, q being any integer, is presented in this paper. The algorithm is based on an optimized implementation of X1/q = 2(1/q)log2(X) by a sequence of parallel and/or overlapped operations: (1) reciprocal, (2) digit-recurrence logarithm, (3) left-to-right carry-free multiplication and (4) on-line exponential. A detailed error analysis and two architectures are proposed, for low precision q and for higher precision q. The execution time and hardware requirements are estimated for single precision floating-point computations for several radices, this helps to determine which radices result in the most efficient implementations. The architectures proposed improve the features of other architectures for q-th root extraction. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the Fixed-Point Accuracy Analysis and Optimization of FFT Units with CORDIC Multipliers

    Publication Year: 2011 , Page(s): 62 - 69
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (396 KB) |  | HTML iconHTML  

    Fixed-point Fast Fourier Transform (FFT) units are widely used in digital communication systems. The twiddle multipliers required for realizing large FFTs are typically implemented with the Coordinate Rotation Digital Computer (CORDIC) algorithm to restrict memory requirements. Recent approaches aiming to optimize the bit-widths of FFT units while satisfying a given maximum bound on Mean-Square-Error (MSE) mostly focus on the architectures with integer multipliers. They ignore the quantization error of coefficients, disabling them to analyze the exact error defined as the difference between the fixed-point circuit and the reference floating-point model. This paper presents an efficient analysis of MSE as well as an optimization algorithm for CORDIC-based FFT units, which is applicable to other Linear-Time-Invariant (LTI) circuits as well. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Self Checking in Current Floating-Point Units

    Publication Year: 2011 , Page(s): 73 - 76
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (271 KB) |  | HTML iconHTML  

    High performance microprocessors are protected against transient and early end of life failures using a variety of error detection and fault isolation technologies. Execution units can be protected with duplication, parity prediction, or residue checking. Residue checking has an advantage due to its small size. A modulus is selected based on the radix of the numbers being checked. In a decimal floating-point unit there are two types of numbers in different bases. There are base 10 decimal numbers and base 2 integers being used. A residue checking system that makes it easy to check both base 2 and 10 numbers is discussed. Current state of the art designs that are currently in use are described as well as a novel hybrid moduli 9 and 3 residue system. The checking systems for the decimal and binary floating-point units of some recent IBM microprocessors including the Power6, Power7, z10, and z196 microprocessors are detailed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • How to Square Floats Accurately and Efficiently on the ST231 Integer Processor

    Publication Year: 2011 , Page(s): 77 - 81
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (256 KB) |  | HTML iconHTML  

    We consider the problem of computing IEEE floating-point squares by means of integer arithmetic. We show how to exploit the specific properties of squaring in order to design and implement algorithms that have much lower latency than those for general multiplication, while still guaranteeing correct rounding. Our algorithms are parameterized by the floating-point format, aim at high instruction-level parallelism (ILP) exposure, and cover all rounding modes. We show further that their C implementation for the binary32 format yields efficient codes for targets like the ST231 VLIW integer processor from ST Microelectronics, with a latency at least 1.75x smaller than that of general multiplication in the same context. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A 1.5 Ghz VLIW DSP CPU with Integrated Floating Point and Fixed Point Instructions in 40 nm CMOS

    Publication Year: 2011 , Page(s): 82 - 86
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (320 KB) |  | HTML iconHTML  

    A next generation VLIW DSP Central Processing Unit (CPU) which has an integrated fixed point and floating point Instruction Set Architecture (ISA) is presented. It is designed to meet a 1.5 GHz core clock frequency in a 40nm process with aggressive area and power goals. In this paper, the benchmarking process and benefits of newly defined instructions such as complex matrix multiply is explained. Also, the CPU data path is described in detail, highlighting several novel micro-architecture features. Finally, our design methodology as well as verification methodology to ensure functional correctness utilizing formal equivalent verification is described. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The POWER7 Binary Floating-Point Unit

    Publication Year: 2011 , Page(s): 87 - 91
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (306 KB) |  | HTML iconHTML  

    The binary Floating-Point Unit (FPU) of the POWER7 processor is a 5.5 cycle Fused Multiply-Add (FMA) design, fully compliant with the IEEE 754-2008 standard. Unlike previous PowerPC designs, the POWER7 FPU merges the scalar and vector FPUs into a single unit executing three floating-point instruction sets: the single and double precision scalar set, the single precision VMX vector set, and the new single and double precision VSX vector and scalar set. Due to a compact buffer-free floor plan and several optimizations in the data and control flow, the streamlined POWER7 FPU achieves a factor of 2 area reduction over the POWER6 design, beyond the normal technology shrink. This results in a very power and area efficient FPU design, supporting a chip frequency of 4.14 GHz. A single 64-bit FPU instance measures only 0.26 mm2 in 45nm CMOS SOI. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Accelerating Computations on FPGA Carry Chains by Operand Compaction

    Publication Year: 2011 , Page(s): 95 - 102
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (284 KB) |  | HTML iconHTML  

    This work describes the carry-compact addition (CCA), a novel addition scheme that allows the acceleration of carry-chain computations on contemporary FPGA devices. While based on concepts known from the carry-look ahead addition and from parallel prefix adders, their adaptation by the CCA takes the context of an FPGA as implementation environment into account. These typically provide carry-chain structures to accelerate the simple ripple-carry addition (RCA). Rather than contrasting this scheme with the hierarchical addition approaches favored in hard-core VLSI designs, the CCA combines the benefits of both and uses hierarchical structures to shorten the critical path, which is still left on a core carry chain. In contrast to previous studies examining the asymptotically superior parallel prefix adders on FPGAs, the CCA is shown to outperform the standard RCA already for operand widths starting at 50 bits. Wider adders such as used in extended-precision floating-point units and in cryptographic applications even benefit from increasing speedups. The concrete mapping of the CCA as achieved for current Xilinx and Altera architectures is described and shown to be very favorable so as to yield a high speedup for a very modest investment of additional LUT resources. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.