By Topic

Application Specific Array Processors, 1995. Proceedings. International Conference on

Date 24-26 July 1995

Filter Results

Displaying Results 1 - 25 of 40
  • Proceedings The International Conference on Application Specific Array Processors

    Save to Project icon | Request Permissions | PDF file iconPDF (154 KB)  
    Freely Available from IEEE
  • Index of Authors

    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (51 KB)  

    Presents an index of the authors whose papers are published in the conference. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A scalable halftoning coprocessor architecture

    Page(s): 76 - 84
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (436 KB)  

    Exact-angle superscreen dithering requires large dither tiles. Since storing precomputed screen elements for each intensity level would require too much memory, dithering must be executed on the fly at halftoning time. For this purpose a dithering coprocessor is presented which generates halftoned images at high speed. The proposed hardware architecture is based on a pipelined and scalable design which speeds up halftoning by a factor of twenty compared with modern RISC software-based solutions. We describe the architecture of the coprocessor and show to what extent it can be scaled for improving performances. The proposed coprocessor could find applications in digital color copiers which need to print scanned color images at high speed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel sequence comparison and alignment

    Page(s): 137 - 140
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (228 KB)  

    Sequence comparisons, a vital research tool in computational biology, is based on a simple O(n2) algorithm that easily maps to a linear array of processors. This paper reviews and compares high-performance sequence analysis on general-purpose supercomputers and single-purpose, reconfigurable, and programmable co-processors. The difficulty of comparing hardware from published performance figures is also noted View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Precise tiling for uniform loop nests

    Page(s): 330 - 337
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (352 KB)  

    The subject of this article is a hyperplane partitioning problem applied to perfect loop nests. This work is aimed at increasing the computation granularity to reduce the overhead due to communication. This study is different from previous work as it takes redundant communication into account. We propose an algorithm giving the optimal solution and various examples to show the validity of this report View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementation of parallel arithmetic in a cellular automaton

    Page(s): 238 - 245
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (316 KB)  

    We describe an approach to parallel computation using particle propagation and collisions in a one-dimensional cellular automaton using a Particle model-a Particle Machine (PM). Such a machine has the parallelism, structural regularity, and local connectivity of systolic arrays, but is general and programmable. It contains no explicit multipliers, adders, or other fixed arithmetic operations; these are implemented using fine-grain interactions of logical particles which are injected into the medium of the cellular automaton, and which represent both data and processors. We give parallel, linear-time implementations of addition, subtraction, multiplication and division View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Column compression pipelined multipliers

    Page(s): 93 - 103
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (464 KB)  

    The paper presents a study on the introduction of pipelining in parallel VLSI multipliers, built according to the column compression (CC) design techniques. A number of CC multiplier schemes have been proposed in the literature, aimed at reducing the number of stages of adders necessary to compute a multiplication. More recently CC multiplier schemes aimed at optimising the required silicon area, the regularity and the locality of the interconnections among the adders, have been proposed. The paper affords the introduction of pipelining in these last structures and compares the obtained results with existing structures, in terms of required number of components and operation frequency View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multilayer cellular algorithm for complex number multiplication

    Page(s): 290 - 297
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (420 KB)  

    A new multilayer cellular algorithm for complex number multiplication is presented. The upper estimate of the time complexity is obtained. The design is based on an original model of distributed computation which is called Parallel Substitution Algorithm View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The MGAP's programming environment and the *C++ language

    Page(s): 121 - 124
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (200 KB)  

    The MGAP is a special-purpose, workstation co-processor board in which the computing elements are fine grain processors implemented as custom ASICs. In this paper we present the language *CC++, used for programming on the MGAP. Using the class concept of C++ we create special parallel data-types like bit, digit, word and array and overload operators to manipulate the parallel data required by the MGAP. The hierarchical relationships among the data-types are used by the compiler to generate parallel code for the MGAP. We demonstrate that by using the same high-level language and the same program we can operate on data at all levels of granularity, from bits to arrays, without any loss in performance View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The systolic design of a block regularised parameter estimator using hierarchical signal flow graphs

    Page(s): 141 - 144
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (172 KB)  

    Hierarchical Signal Flow Graphs (HSFGs) am used to illustrate the computations and the data flow required for the block regularised parameter estimation algorithm. This algorithm protects the parameter estimation from numerical difficulties associated with insufficiently exciting data or where the behaviour of the underlying model is unknown. Hierarchical signal flow graphs (HSFGs) aid the user's understanding of the algorithm as they clearly show how the algorithm differs from exponentially weighted recursive least squares, but also allow the user to develop fast efficient parallel algorithms easily and effectively, as demonstrated View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design and implementation of a parallel image processor chip for a SIMD array processor

    Page(s): 66 - 75
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (464 KB)  

    This paper presents the design and implementation of a sliding memory plane (SliM) image processor chip to build a mesh-connected SIMD architecture called a SliM array processor. The SliM image processor chip consists of 5×5 processing elements (PEs) connected by a mesh topology. A set of SliM image processor chips can form the SliM array processor. Due to the idea of sliding, that is, overlapping inter-PE communication with computation, the SliM image processor can greatly reduce the inter-PE communication overhead, a significant disadvantage of existing SIMD array processors. In addition, using the by-passing path provides eight-way connectivity even with four physical links. This paper addresses architectures of the SliM image processor chip, the design of an instruction set, and implementation issues. The chip has 55255 gates and twenty-five 128×9-bit SRAM modules, and was simulated at 18 MHz for the worst case conditions, and will actually run at a higher clock rate. The package type is the 144 pin MQFP. We conduct the performance evaluation of the chip that shows a significant improvement View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The VLSI design and implementation of the array processors of a multilayer vision system architecture

    Page(s): 125 - 128
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (212 KB)  

    This paper describes the VLSI design and simulation of the lower layer processors of the KYDON vision system. KYDON is a completely autonomous, hierarchical, multilayered image understanding system. The VLSI design of the individual components as well as the timing simulation results of the processor array have been presented. The system runs at 50 MHz and promises a high processing rate of 300 image frames/sec View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Digit on-line large radix CORDIC rotator

    Page(s): 246 - 257
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (460 KB)  

    Many applications figure the evaluation of rotations at high speeds. However there is a trade-off between the chip area and the latency. In this paper we develop a digit on-line pipelined array architecture based on the radix-4 CORDIC algorithm in rotation mode. The radix-4 CORDIC algorithm halves the number of microrotations with respect the traditionally radix-2 algorithm with the drawback of a non-constant scale factor. Seeking a good compromise between silicon area and latency we have used digit on-line processing. This way the data inputs the processor in blocks of bits (digits) in MSD-first mode of processing. We have used redundant carry-save arithmetic to allow carry-free additions and on-line processing. The designed processor demonstrates to have a better performance than previous digit on-line architectures View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • MOVIE: a building block for the design of real time simulator of moving pictures compression algorithms

    Page(s): 193 - 203
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (576 KB)  

    This paper shows how a real-time simulator of moving pictures compression algorithms can be rapidly assembled using a basic building block, here called MOVIE (MOdule for Video Experimentation). The internal architecture of the MOVIE VLSI chip can be compared to a small systolic machine made of a 32-bit I/O processor, a reduced linear array of 16-bit computation processors and data video input/output mechanisms. Externally, the chip is provided with four 16-bit bidirectional data ports and three 16-bit bidirectional data video port. Several MOVIE chips can be easily clustered to allow the size of the linear array of computation processors to be increased. The MOVIE chip is fully programmable in a high level language in order to make program developments easier View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Synthesis of VLSI architectures for two-dimensional discrete wavelet transforms

    Page(s): 174 - 181
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (428 KB)  

    We propose VLSI architectures with parallel I/O capability to compute the Two-Dimensional Discrete Wavelet Transform. Our design can handle large images arriving at high frame rates. A video codec based on our architecture can support multiple channels in parallel and can provide the needed performance for network based video applications. Our architecture with parallel I/O offers a solution for the low power needs of mobile/visual communication systems. Our architecture employs block-based I/O and a dual memory buffer to store intermediate results to schedule the filter operations. This leads to a high throughput rate of n pixels per clock cycle and a small memory size of j(l-1)/(N+n)+2n 2, for an N×N input image, where n×n is the block size, l is the alter length, and j is the number of octaves. The resulting architecture has a latency of 2l+n for each octave and a total execution time of N2/n+2l+n+3jn View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Time-optimal ranking algorithms on sorted matrices

    Page(s): 42 - 53
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (648 KB)  

    Answering rank queries is a recurring operation in various application domains including geographic data processing, information retrieval, database design, information management, and medical image processing. Many of these applications involve data stored in a matrix satisfying a number of properties. One property that occurs time and again in applications specifies that the rows and the columns of the matrix are independently sorted. It is customary to refer to such a matrix as sorted. An instance of the Batched Ranking problem, (BR, for short) involves a sorted matrix A of items from a totally ordered universe, along with a collection Q of queries of the following type: for a query qj one is interested in the number of items in A that are smaller than qj. The BR problem asks for solving all queries in Q. In this work, we consider the BR problem in the following context: the matrix A is pretiled, one item per processor, onto an enhanced mesh of size √n×√n; the m queries are stored, one per processor, in the first m/√n columns of the platform. Our main contribution is twofold. First, we show that any algorithm that solves the BR problem must take at least Ω(log n+√m) time in the worst case. Second, we show that this time lower bound is tight on meshes of size √n×√n enhanced with multiple broadcasting, by exhibiting an algorithm solving the BR problem in O(log n+√m) time on such a platform View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A processor for staggered interval arithmetic

    Page(s): 104 - 112
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (416 KB)  

    The paper presents the design of a high-speed processor which performs staggered interval arithmetic. Each staggered interval is represented as the sum of a set of floating point numbers plus an interval, which consists of two floating point endpoints. Staggered interval arithmetic allows the precision of the computation to be specified and the accuracy of the result to be determined. Efficient arithmetic algorithms, which reduce the number of floating point operations needed to perform staggered interval arithmetic, are introduced. To achieve high performance, the processor employs an array of pipelined floating point arithmetic units and two long accumulators. The processor provides direct hardware support for accurate and numerically reliable vector and matrix computations View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Minimizing synchronization overhead in statically scheduled multiprocessor systems

    Page(s): 298 - 309
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (648 KB)  

    Synchronization overhead can significantly degrade performance in embedded multiprocessor systems. This paper develops techniques to determine a minimal set of processor synchronizations that are essential for correct execution in an embedded multiprocessor implementation. Our study is based in the context of self-timed execution of iterative dataflow programs; dataflow programming in this form has been applied extensively, particularly in the context of signal processing software. Self-timed execution refers to a combined compile-time/run-time scheduling strategy in which processors synchronize with one another only based on inter-processor communication requirements, and thus, synchronization of processors at the end of each loop iteration does not generally occur. We introduce a new graph-theoretic framework, based on a data structure called the synchronization graph, for analyzing and optimizing synchronization overhead in self-timed, iterative dataflow programs. We also present an optimization that involves converting a synchronization graph that is not strongly connected into a strongly connected graph View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A parallelizing compilation method for the map-oriented machine

    Page(s): 129 - 132
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (240 KB)  

    The paper introduces a novel parallelizing compilation method for the MoM. The MoM (Map-oriented Machine) is an Xputer architecture featuring multiple data sequencers and “soft ALUs”. The compiler accepts C-source, which are restructured and partitioned into structural and sequential code providing parallelism at expression and statement level View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Techniques for yield enhancement of VLSI adders

    Page(s): 222 - 229
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (320 KB)  

    For VLSI application-specific arrays and other regular VLSI circuits, two techniques are available for yield enhancement, namely defect-tolerance and layout modifications. In this paper, we compare these two yield enhancement approaches by using adders as an example. Our yield projections indicate that the layout modification technique is more efficient when the defect density is low, while reconfiguration is more efficient for a high defect density. However, from the point of the view of effective yield, the layout modification is superior to defect tolerance in the practical range of defect density View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A solid translation engine using ray representation

    Page(s): 157 - 165
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (456 KB)  

    We describe an extension to the geometric domain of solid modeling to include solids defined by spatial sweeping and Minkowski sums. We develop an efficient, parallel algorithm for the translation of such solid models. An architecture and design of an array processor that implements this algorithm are presented. We discuss some applications of the new computer to solid modeling an CAD/CAM and modeling of large biomolecules (proteins) for rational drug design View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Systolic filter for fast DNA similarity search

    Page(s): 145 - 156
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (464 KB)  

    This paper presents a systolic filter for speeding up the scan of DNA databases. The filter acts as a co-processor which performs the more intensive computations occurring during the process. Our validation, based on a FPGA prototype board tightly connected to a workstation, has shown that the filter may boost the performance of the machine by a factor ranging from 50 to 400 over current workstations View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An array processor for inner product computations using a Fermat number ALU

    Page(s): 270 - 281
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (436 KB)  

    This paper explores an architecture for parallel independent computations of inner products over the direct product ring ℜ257×17. The structure is based on the polynomial mapping of the Modulus Replication RNS for calculations over dynamic ranges much larger than the product of the computational moduli. We show that the computational ring is optimal for our purposes, and introduce basic cells for the efficient calculation of all elements of the polynomial ring computations View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • VLSI algorithms for compressed pattern search using tree based codes

    Page(s): 133 - 136
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (252 KB)  

    Data compression methods are used to reduce the redundancy in data representation in order to decrease the data storage requirements and communication costs. In order to exploit the benefits of data compression to conserve internal processor storage and computation resources, it is desirable to perform operations on compressed data without decompressing it. We present hardware algorithms and VLSI implementation of a chip to search a compressed text with respect to keys or patterns in compressed form using Huffman-type tree-based codes View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel implementation of the full search block matching algorithm for motion estimation

    Page(s): 182 - 192
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (576 KB)  

    Motion estimation is a key technique in most algorithms for video compression and particularly in the MPEG and H.261 standards. The most frequently used technique is based on a Full Search Block Matching Algorithm which is highly computing intensive and requires the use of special purpose architectures to obtain real-time performance. We propose an approach to the parallel implementation of the Full Search Block Matching Algorithm which is suitable for implementation on massively parallel architectures ranging from large scale SIMD computers to dedicated processor arrays realized in ASICs. While the first alternative can be used for the implementation of high performance coders the second alternative is particularly attractive for low cost video compression devices. This paper describes the approach proposed for the parallel implementation of the Full Search Block Matching Algorithm and the implementation of such an approach in an ASIC View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.