By Topic

Application-Specific Array Processors, 1993. Proceedings., International Conference on

Date 25-27 Oct. 1993

Filter Results

Displaying Results 1 - 25 of 63
  • Proceedings of International Conference on Application Specific Array Processors (ASAP '93)

    Save to Project icon | Request Permissions | PDF file iconPDF (146 KB)  
    Freely Available from IEEE
  • An application specific processor for implementing stack filters

    Page(s): 196 - 199
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (156 KB)  

    Stack filters which are a generalization of the rank-order filters, have great practical importance in image processing. Work reported in this paper is based on the new realization that stack filter can be implemented without undergoing threshold decomposition (i.e., complexity of implementation increases linearly with the precision of the input data). The authors have presented the design of a specialized processor which implements all window-width-three stack filters with a precision of eight. This kind of programmable filter can be highly useful especially in fields like biomedical instrumentation, image processing, speech processing and communication systems View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Node merging: A transformation on bit-level dependence graphs for efficient VLSI array design

    Page(s): 442 - 453
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (456 KB)  

    The authors present a transformation technique, called node merging, on bit-level dependence graphs to systematically explore tradeoffs between area and various system performances, such as clock period, pipelining period, block pipelining period, computation time, and dynamic power dissipation to obtain optimal VLSI array processors for bit-level regular algorithms. By merging several DG nodes into one node, multi-bit level array processors can be designed using formal regular array synthesis methods thereby significantly reducing the number of pipelining registers required compared to bit-pipelined array processors. In general, delay paths within a node in bit-level dependence graphs are unbalanced, and the clock period is determined by the critical path delay. By merging nodes along noncritical paths, the authors improve computation time as well as VLSI area with a relatively small increase in clock period and pipelining period. They also expand the complexity of node functions enough to apply a meaningful logic optimization or performance enhancement using well-known logic synthesis tools such as SIS for even further improvement. Since the transformation results in a new DG, it can be easily combined with conventional VLSI array synthesis techniques for efficient bit-level array processor design. Therefore, the method provides an efficient way to explore a significantly broader design space in VLSI array processor design View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A highly-parallel match architecture for AI production systems using application-specific associative matching processors

    Page(s): 180 - 183
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (276 KB)  

    Here, a highly-parallel two-layer match architecture using specific associative matching processors (AMPs) is proposed to speed up the execution time of match process of AI production systems. Each AMP comprises a 2D array of content-addressable memories, called CAM blocks. The architecture first compiles the left-hand (LHS) of each production into a symbolic form, and then assigns a number of contiguous CAM blocks in an AMP to the patterns in the LHS of each production individually. Those CAM blocks are used not only to buffer the database of current assessment (also called the working memory, WM), but also to support the functions of parallely evaluating interconditions among patterns of productions. The set of productions that are affected during a match cycle can be evaluated parallely and independently among their associating CAM blocks resided in the AMPs. Preliminary simulation result shows that the novel architecture provides the opportunity to at least ten-fold the performance of conventional forward-chaining production sytems View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multi-rate transformation of directional affine recurrence equations

    Page(s): 392 - 403
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (516 KB)  

    There has been an increased attention to the synthesis of algorithmic specific pipeline arrays such as systolic arrays. Most of the existing synthesis techniques are based on a transformation of the algorithm from a class of Recurrence Equations such as Uniform Recurrence Equations (UREs). However, many algorithms cannot be transformed to a URE and the temporal locality of systolic arrays results in additional delay time. The temporal locality constraint can be removed by using the multi-rate array (MRA) structure. In MRA the variables are propagated at different rates. By allowing data transmission at different clock rates, transparent data or ones with small delays are propagated. It is shown that using MRA, a broader class of REs termed directional affine recurrence equation (DARE) can be mapped onto pipeline arrays. The authors provide the definition and a synthesis technique for mapping DAREs on multi-rate array. Conditions for mapping AREs onto MRA is given and the corresponding timing and allocation functions are derived. Applications of multi-rate arrays for signal processing algorithms is also presented View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • RELACS for systolic programming

    Page(s): 132 - 135
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (168 KB)  

    The RELACS language is a systolic programming language, which simplifies the programmer's task by making explicit the data-flow of systolic algorithms, and by exposing the data delivery mechanism. The underlying architecture model is different from other SIMD architectures in that it physically separates computation and data management. The authors introduce the RELACS language as a syntaxic and a sermantic extension of the C language. It is shown that the RELACS programming model provides a simple programming method for systolic algorithms, which is applicable to a variety of parallel machines View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A period-processor-time-minimal schedule for cubical mesh algorithms

    Page(s): 261 - 272
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (548 KB)  

    The paper, using a direct acyclic graph (dag) model of algorithms, investigates precedence constrained multiprocessor schedules for the n × n × n directed mesh. This cubical mesh is fundamental, representing the standard algorithm for square matrix product, as well as many other algorithms. Its completion requires at least 3n - 2 multiprocessor steps. Time-minimal multiprocessor schedules that use as few processors as possible are called processor-time-minimal. For the cubical mesh, such a schedule requires at least [3n2/4] processors. Among such schedules, one with the minimum period (i.e., maximum throughput) is referred to as a period-processor-time-minimal schedule. The period of any processor-time-minimal schedule for the cubical mesh is at least 3n/2 steps. This lower bound is shown to be exact by by constructing, for n a multiple of 6, a period-processor-time-minimal multiprocessor schedule that can be realized on a systolic array whose topology is a toroidally-connected n/2 × n/2 × 3 mesh View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Time-optimal visibility-related algorithms on meshes with multiple broadcasting

    Page(s): 226 - 237
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (568 KB)  

    The compaction step of integrated circuit design motivates the study of various visibility problems among vertical segments in the plane. One popular variant is referred to as the Vertical Segment Visibility problem (VSV, for short) and is stated as follows. Given a collection S of n disjoint vertical line segments in the plane, for every endpoint of a segment in S determine the first line segment, if any, interacted by a horizontal ray to the right (resp. left) originating from that endpoint. The contribution of this paper is to propose a time-optimal algorithm for the VSP problem on meshes with multiple broadcasting. The authors then use this algorithm to derive time-optimal solutions for two related problems. All the algorithms run in O(log n) time on a mesh with multiple broadcasting of size n × n. This is the first instance of time-optimal solutions for these problems known to us View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Realization of a real time phasecorrelation chipset used in a hierarchical two step HDTV motion vector estimator

    Page(s): 152 - 155
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (184 KB)  

    The phasecorrelation algorithm - as a method for motion estimation - is a key component of todays TV and tomorrows HDTV-systems. One advantage of hardware realization of this algorithm for efficient real time processing - in opposite to blockmatching - is the possibility to process multiple pixels per system clock cycle. A suitable partition using three different VLSI-circuits to perform the phasecorrelation algorithm is going to be proposed. The system is able to handle block sizes from 32 × 16 up to 128 × 128, thus estimation of motion vectors limited to (-64..+64) pixel becomes possible. The estimated motion vectors are prepared by the phasecorrelation to be handled by a following blockmatching unit and are helpful to reduce hardware effort by limiting its search range. The combination of both algorithms leads to a two-step hierarchical motion estimator. To minimize the physical volume all external components except RAMs, have been integrated on three specialized chips, where all parameters are free programmable and no glue logic is required View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mapping algorithms onto a multiple-chip data-driven array

    Page(s): 41 - 52
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (420 KB)  

    Data-driven arrays provide high levels of parallelism and pipelining for algorithms with no internal regularity. Most of the methods previously developed for mapping algorithms onto processor arrays assumed an unbounded array (i.e., one in which there will always be a sufficient number of processing elements (PEs) for the mapping). Implementing such an array is not practical. A more practical approach would be to assign the PEs to chips and map the given algorithm onto the new array of chips. The authors describe a way to directly map algorithms onto a multiple-chip data-driven array, where each chip contains a limited number of PEs. There are two optimization steps in the mapping. The first is to produce an efficient mapping by minimizing the area (i.e., the number of PEs used) as well as optimizing the performance (pipeline period and latency) for the given algorithm, or finding a trade-off between area and performance. The second is to divide the unbounded array among several chips each containing a bounded number of PEs View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel processing architectures for rank order and stack filters

    Page(s): 65 - 76
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (536 KB)  

    To achieve additional speedup in rank order and stack filter architectures requires the use of parallel processing techniques such as pipelining and block processing. Pipelining is well understood but few block architectures have been developed for rank order and stack filtering. Block processing is essential when the architecture reaches the throughput limits caused by the underlying technology. A trivial block structure repeats a single input, single output structure to generate a multiple input, multiple output structure and can achieve speedups equal to the block size (or the number of multiple outputs). Unlike linear filters, the rank order and stack filter outputs are calculated using comparisons. It is possible to share these comparisons within the block structure. The authors introduce a systematic method for applying block processing to the rank order and stack filters. This method takes advantage of shared comparisons within the block structure to generate a block filter with shared substructures whose complexity is reduced. Furthermore, block processing is important for the generation of low power designs. Trivial block structures generate low power designs up to a certain limit. The authors demonstrate how block structures with shared substructures are used to generate designs with arbitrarily low power View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reconfigurable hardware for molecular biology computing systems

    Page(s): 184 - 187
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (216 KB)  

    The authors explore the scope of possibilities of the implementation of molecular biology algorithms on a reconfigurable hardware based architecture. In order to demonstrate both the flexibility and power of reconfigurable hardware, two algorithms have been implemented on an architecture constituted by 23 FPGAs (Xilinx XC3090) and 4MB of SRAM. These two algorithms have been chosen because of their biological relevance. One is a classical dynamic programming algorithm that is implemented as a monodimensional systolic array. The other, as described in this paper, is based on pattern detection in long sequences of letters. This algorithm has been implemented as a rapid content addressable memory. It is important to stress the fact that these performances are achieved thanks to the use of the entire possibilities of reconfigurability. This enables one to exploit to the full each level of parallelism of the algorithm. The concept of reconfigurable hardware perfectly suits the needs of molecular biology. The performances demonstrated in this paper are several magnitude orders greater than a 40 MIPS workstation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • VLSI array synthesis for polynomial GCD computation

    Page(s): 536 - 547
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (500 KB)  

    Polynomial GCD (greatest common divisor) finding is an important problem in algebraic computation, especially in decoding error correcting codes. The authors show a new systolic array structure for the polynomial GCD problem using a systematic array synthesis technique. The VLSI implementation of the array structure is area-efficient and achieves maximum throughput with pipelining. The dependency graph (DG) of the Euclid GCD algorithm is drawn using iterated polynomial division. The resulting DG is data-dependent and variable-sized. The authors consider the worst-case implementation to make the DG data-dependent and fixed-size, where data-dependences are hidden inside by introducing four different working modes in each DG node. This novel approach requires just a few additional multiplexors and can be generalized for other data-dependent and variable-sized computation. The authors then map the DG to a one-dimensional systolic array using a linear mapping. The new array structure has m0 + n0 + 1 processing elements, where m0 and n0 are degrees of two polynomials. It can find a GCD of any two polynomials of total degree less than or equal to m0 + n0. The block pipeline period is one, which means that it can start a new GCD computation immediately in the next cycle. Unlike the array of Brent and Kung, a pre-processing step for extracting a common factor Xi is not necessary and the size of the processing element (PE) does not depend on m0 and n0. The authors extend this new array structure to the extended polynomial GCD algorithm, which is closely related to the decoding of BCH and Reed-Solomon codes. To verify the structure, they have used the VERILOG simulator, and implemented a 2 μ CMOS test chip View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Synthesis of dedicated SIMD processors

    Page(s): 416 - 427
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (504 KB)  

    In this paper, a synthesis method of dedicated architectures is introduced. Its aim is to produce optimized systems derived from the algorithmic expression of a numerical application. The approach addresses the design of dedicated systems for applications that require high numerical computations. An efficient utilization of hardware resources is achieved through the use of vector processing with an SIMD implementation. The synthesis algorithm realizes simultaneously the design of SIMD structures and the generation of the microcode needed for implementing a software pipelining of operation of the input program View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Data flow graphs granularity for overhead reduction within a PE in multiprocessor systems

    Page(s): 136 - 139
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (184 KB)  

    The authors propose a method to implement Acyclic Data Flow Graphs (ADFG) in any general purpose multiprocessor system supporting a CSP type language. The granularity of ADFG nodes is discussed During ADFG analysis the authors use fine granularity to exploit all the parallelism inherent in the problem. When the graph G has been allocated, it is divided into P subgraphs Gk (P is the number of the processors in the network); each of them is assigned to a processor P k. Gk still presents some parallelism which cannot be usefully exploited and which implies a certain amount of overhead during Gk execution; the authors reduce this overhead by executing a post allocation analysis which increases the granularity of Gk nodes. In order to evaluate GTP performances the Overhead Reduction Ratio (ORR) is introduced. ORR is computed by using the results of several executions of GTP on randomly generated ADFG; the presented tests show the effectiveness of GTP View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low-power polygon renderer for computer graphics

    Page(s): 200 - 213
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (780 KB)  

    Polygon rasterization is the most computational and memory intense stage in rendering synthesized computer images. The authors present a low-power, real-time hardware implementation for this task. Rasterization of two-dimensional Gouraud-shaded polygons at 90,000 polygons/sec is achievable with computational power consumption of about 12 mW at 1.5 V operation, using an array configuration of 16 render engines for a 512 × 512-pixel frame. A transmission format for wireless implementations is proposed with a typical bandwidth of 4 MHz. This screen and format configurable design has potential applications in portable devices, wireless communication and head-mounted display for virtual reality systems View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A real-time systolic algorithm for on-the-fly hidden surface removal

    Page(s): 238 - 249
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (452 KB)  

    Hidden surface removal for real-time realistic display of complex scenes requires intensive computation and justifies usage of parallelism to provide the needed response time. The authors present a systolic algorithm that identifies visible segments on a scanline with the "real-time" characteristic: visible segments are output on-the-fly as soon as segments are input to the systolic array. The proposed systolic architecture consists of a linear array of simple cells. The size of the systolic array is equal to the maximum number of input segments that are crossed by a vertical line. The correctness of the systolic algorithm is proven by first establishing the correctness of an initial version, which is then successfully transformed to more efficient versions, while preserving the properties for correctness View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GENES IV: A bit-serial processing element for a built-model neural-network accelerator

    Page(s): 345 - 356
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (540 KB)  

    A systolic array of dedicated processing elements (PEs) is presented as the heart of a multimodel of several widely-used neural models, including multilayer perceptrons with the backpropagation learning rule and Kohonen feature maps. Each PE holds an element of the synaptic weight matrix. An instantaneous swapping mechanism of the weight matrix allows the implementation of neural networks larger than the physical PE array. A systolically-flowing instruction accompanies each input vector propagating in the array. This avoids the need of emptying and refilling the array when the operating mode of the array is changed. Both the GENES-IV chip, containing a matrix of 2 × 2 PEs, and an auxiliary arithmetic circuit have been manufactured and successfully tested. The MANTRA I machine has been built around these chips. Peak performances of the full system are between 200 and 400 MCPS in the evolution phase and between 100 and 200 MCUPS during the learning phase (depending on the algorithm being implemented) View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A novel architecture for a decision-feedback equalizer using extended signal-digit feedback

    Page(s): 490 - 501
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (404 KB)  

    A novel bit-level systolic array architecture for implementing a bit parallel decision-feedback equalizer (DFE) is presented. Core of the architecture is an array multiplier using redundant arithmetic in combination with bit-level feedback. The use of signal-digit (SD) circuitry allows one to feed back each digit as soon as it is available. So the recursive computation can be executed with the most significant digit first (MSD first). This way a very high data throughput rate for large wordlengths is achieved. The combination of two SD-digits to one feedback digit allows one to further increase the throughput. The nonlinear quantization operation of the DFE is implemented by a combination of a saturation- and an integer operation. The saturation is done by a MSD first saturation unit for redundant digits. The integer operation is implemented by removing the fractional part of the intermediate result. The error caused by removing the fractional part of a redundant number is compensated by a correction unit. For a second-order DFE with two complex coefficients a throughput of one sample every three clock cycles is accomplished. The clock period is 4.75 full adder delays. This throughput is constant for large wordlengths. The architecture can be extended easily to higher order filters View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Digit systolic algorithms for fine-grain architectures

    Page(s): 466 - 477
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (548 KB)  

    In this paper, the authors present a novel scheme for performing arithmetic efficiently on fine-grain programmable architectures and FPGA-based systems. They achieve an O(n) speedup over the bit-serial methods of existing fine-grain systems such as the DAP, the MPP and the CM2, within the constraints of regular, near neighbor communication and only a small amount of on-chip memory. This is possible by means of digit systolic algorithms which avoid broadcast and operate in a fully systolic manner at the digit level. They use digit online techniques coupled with a base 4, signed-digit number system to limit carry propagation. Although the algorithms are bit-serial, the authors are able to match the performance of the bit-parallel methods, while retaining low communication complexity. Efficient O(n) time algorithms for multiplication and division of fixed-point, variable precision numbers are given. By using the organization of logic blocks suggested in this paper, problems of placement and routing that exist in systems built using FPGAs can be avoided. Since the algorithms are amenable to pipelining, very high throughput can be obtained View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Matrix-matrix multiplications and fault tolerance on hypercube multiprocessors

    Page(s): 176 - 180
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (164 KB)  

    Several new algorithms for matrix-matrix multiplications on hypercube multiprocessors are presented and evaluated based on the number of multiplications, additions, and transfers. The matrices to be multiplied are uniformly distributed to all processors of a hypercube system. Each processor owns some submatrices which are derived by dividing the source matrices. Each submatrix multiplication can now be performed independently within a processor. All the partial results are then summed up and transferred to a single processor. An orthogonal tree is used for efficient communication. The time complexity is O(log2 p) if p × p processors are used. In addition, the UDD (Uniform Data Distribution) approach is employed when some processors do not work properly and the faulty effects have been detected. Two classes of fault patterns are considered and evaluated View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • I/O data management on SIMD systolic arrays

    Page(s): 273 - 284
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (428 KB)  

    A mechanism for overlapped I/O management operations and computations on a SIMD linear systolic arrays is presented. This mechanism is based on two synchronized controllers allowing a speedup factor of two over SIMD machines without overlapped facility. Optimal code generation is achieved using the ReLaCS environment, specifically designed for the architectural features of overlapped SIMD systolic arrays View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Asynchronous relaxation of locally-coupled automata networks, with application to parallel VLSI implementation of iterative image processing algorithms

    Page(s): 156 - 159
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (248 KB)  

    Array processors tailored to mesh-based iterative algorithms benefit from shifting to an asynchronous mode. An architecture implementing this functionally asynchronous state-space update with self-timed elementary processors can dispense with the overhead of classical data exchange protocols and offer a flexible hierarchical mapping of the state lattice onto the array. The performance and practical feasibility of this approach are assessed on two application examples: iterative elliptic PDE resolution and image motion estimation using mean field annealing based on Markov random fields models View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Systolic design of a new finite field division/inverse algorithm

    Page(s): 188 - 191
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (172 KB)  

    A systolic architecture of a newly developed algorithm for performing division and inversion over GF(2m) has been successfully realized. It is novel in that the normal inverse/multiplication steps are integrated and the generator polynomial is selectable. The new design with its inherent regularity offers an expandable, fully pipelined high performance circuit, that is very suitable to a VLSI implementation. A GF(28) divider has been successfully implemented under the EUROCHIP program View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Heterogeneous BISR techniques for yield and reliability enhancement using high level synthesis transformations

    Page(s): 454 - 465
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (556 KB)  

    Built-In-Self-Repair (BISR) is a fault tolerance technique against permanent faults, where in addition to core operational modules, a set of spare modules is provided. If a faulty core module is detected, it is replaced with a spare module. The BISR methodology has been used only in situations where a failed module of one type can only be replaced by a backup module of the same type. The authors propose a new BISR approach for ASIC design which removes this constraint and enables replacement of modules of different types with the same spare units by exploiting the design space exploration abilities provided by the use of transformations in high level synthesis. Fast and efficient high level synthesis algorithms which take into account peculiarities of transformation-based design for BISR are presented. The potential of the approach is demonstrated on a set of benchmark examples by showing significant yield and relative productivity improvements which are calculated using state-of-the-art yield modeling techniques View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.