Scheduled System Maintenance:
On Monday, April 27th, IEEE Xplore will undergo scheduled maintenance from 1:00 PM - 3:00 PM ET (17:00 - 19:00 UTC). No interruption in service is anticipated.
By Topic

Application Specific Array Processors, 1990. Proceedings of the International Conference on

Date 5-7 Sept. 1990

Filter Results

Displaying Results 1 - 25 of 69
  • Proceedings of the International Conference on Application Specific Array Processors (Cat. No.90CH2920-7)

    Publication Year: 1990
    Save to Project icon | Request Permissions | PDF file iconPDF (27 KB)  
    Freely Available from IEEE
  • Algorithmic mapping of neural network models onto parallel SIMD machines

    Publication Year: 1990 , Page(s): 259 - 271
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (516 KB)  

    The authors consider parallel implementation of neural network computations of fine grain SIMD machines. The authors show a mapping of a neural network having n nodes and e connections onto a parallel machine having (n+e) PEs arranged in an array of √n+e×√n+e PEs such that routing for each update iteration of the recall phase can be performed in 24(√n+e-1) elemental data shifts. The array uses simple PEs and few local registers to perform the routing and computations. The method is simple and is well suited for implementation of various classes of neural networks on many currently available parallel machines View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • PASIC. A sensor/processor array for computer vision

    Publication Year: 1990 , Page(s): 352 - 366
    Cited by:  Papers (1)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (472 KB)  

    The PASIC prototype chip contains 256×256 photosensors, a linear array of 256 A/D converters, two 256 8-bit shift registers, 256 bit-serial processors, and a 256×256 bit dynamic RAM. It appears to be a viable architecture for low-level vision processing. The processors operate in SIMD model at 20 MHz. To avoid high speed transfer of analog data, an A/D converter in the form of a linear array of comparators is used. The architecture of the processing part conforms to the row parallel output from the A/D-converters. A simple but efficient processor excellently suited to the special VLSI constraints of the sensor was designed. The pitch in the present version of PASIC is 30 μm and it was possible to fit the A/D-converter circuitry, the shift register, the ALU, and the memory into this narrow slot. A key factor is the unified structure achieved by extending the memory data bus to all other units within the same column. The versatility of the chip is shown using three algorithms: edge detection, shading correction, and histogram-based thresholding. Each is executed in approximately 10 ms View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysing parametrised designs by non-standard interpretation

    Publication Year: 1990 , Page(s): 133 - 144
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (444 KB)  

    The authors consider the use of a nonstandard interpretation to analyze parametrized circuit descriptions, in particular for array based architectures. Various metrics are employed to characterize the performance tradeoffs for generic designs. The objective is to facilitate the comparison of feasible design alternatives at an early stage of development. The research centers on techniques for extracting various performance attributes, such as critical path and latency, from a single generic design representation. The features of this approach include-uniformity, modularity, reusability, flexibility, and computerized support View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The bit-serial systolic back-projection engine (BSSBPE)

    Publication Year: 1990 , Page(s): 43 - 54
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (344 KB)  

    The author presents a machine designed with a two-phase approach. First, the selection of an efficient algorithm, based on the quality of the final image and on the computational efficiency, is undertaken. Second, the algorithm is realized in hardware which incorporates efficient array processing structures, with the aim of creating regular repeated structures. The design is based on the S.Y. Kung and C.E. Leiserson (1978) approach, although certain elements of the architecture deviate from a true systolic architecture. These include bidirectional communication, which allows other image processing operations to be performed. The bit-serial machine is designed to perform the image reconstruction operation known as back-projection. The machine offers significant speed improvement over the general-purpose pipeline architectures used at present View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel algorithm for traveling salesman problem on SIMD machines using simulated annealing

    Publication Year: 1990 , Page(s): 712 - 721
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (360 KB)  

    The authors present a fast parallel simulated annealing algorithm for solving the traveling salesman problem (TSP) on single-instruction multiple-data (SIMD) machines with linear interconnections among processing elements. In the algorithm for TSP, it is shown that with the proper data distribution and movement schemes, the generation of a new configuration and the calculation of energy difference can be done in constant time, and the whole time complexity of the move operation is proportional to the time taken to broadcast one bit information on an acceptance decision to all the other processing elements (PEs) on a linear class computer with the same number of PEs as of cities. Therefore, if a control unit has the broadcasting capability, as is often the case with the SIMD machine, the move operation can be done in constant time and the whole simulated annealing algorithm has a time complexity only proportional to the number of moves View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A processor-time minimal systolic array for transitive closure

    Publication Year: 1990 , Page(s): 19 - 30
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (436 KB)  

    A directed acyclic graph (DAG) model of algorithms is used. For a given DAG the authors focus on processor-time minimal multiprocessor schedules: time minimal multiprocessor schedules that use as few processors as possible. The Kung, Lo and Lewis (KLL) algorithm (S.-Y. Kung et al., 1987) for computing the transitive closure of a relation over a set of n elements requires at least 5n-4 steps. Their systolic array comprises n2 processing elements. Here, it first is shown that any multiprocessor that achieves this 5n-4 time bound needs at least [n2/3] processing elements. Then, a processor-time minimal systolic array realizing the KLL algorithm's DAG is constructed. Its [n2 /3] processing elements are organized as a cylindrically connected 2-D mesh, when n≡0 mod 3. When n is not identical to 0 mod 3, the 2-D mesh is connected as a twisted torus View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Building blocks for a new generation of application specific computing systems

    Publication Year: 1990 , Page(s): 190 - 201
    Cited by:  Papers (5)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (660 KB)  

    The iWarp processor, which integrates both communication and computation functions on a single VLSI component, is described. The iWarp component and subsystems including it are powerful building blocks for constructing a new generation of application-specific computing systems. These special-purpose systems can achieve very high performance, while maintaining a high degree of flexibility to address different needs of an application. In particular, iWarp systems deliver high computation bandwidth (up to 20 GFLOPS for a 1024 cell system), as well as high communication bandwidth (320 Mbytes/s per cell). Programming these systems is assisted by modern tools such as optimizing compilers and parallel program generators View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A database machine based on surrogate files

    Publication Year: 1990 , Page(s): 55 - 66
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (580 KB)  

    Concatenated code word (CCW) surrogate files are useful as indexes for very large knowledge bases to support logic programming inference mechanisms because of their small size and simple maintenance requirements. A parallel back-end database machine is proposed to speed up relation operations based on the CCW surrogate files. The basic idea of the machine is to reduce the amount of fact data to be transferred from the secondary storage systems to satisfy a query by performing relational operations on the CCW surrogate files first. The database machine consists of a number of surrogate file processors (SFPs) and extensional database processors (EDBPs) operating in SIMD mode. The performance of the system for parallel relational operations is evaluated for various cases View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • GRAPE: a special-purpose computer for N-body problems

    Publication Year: 1990 , Page(s): 180 - 189
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (408 KB)  

    GRAPE (GRAvity PipE) is a special-purpose computer designed to accelerate the numerical integration of the astrophysical N-body problem. The prototype hardware, GRAPE-1, is designed as the backend processor that calculates the gravitational interaction between particles. All other calculations are performed on the host computer connected to GRAPE-1. For large-N calculations (N ≳104), GRAPE-1 achieves about 200 Mflops equivalent in one board of the size of about 40 cm by 30 cm, consuming 2.5 W of power. The specialized pipelined architecture of the GRAPE-1 optimized for the large N calculation is the key to the high performance. The authors describe the design, construction and programming of GRAPE-1. The architecture is quite simple, and it is easy to put one pipeline into one LSI chip and make many pipelines work in parallel, without creating a communication bottleneck View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Systolic array implementation of nested loop programs

    Publication Year: 1990 , Page(s): 31 - 42
    Cited by:  Papers (17)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (608 KB)  

    The authors consider a formal and systematic method to convert a class of nested loop programs to single assignment codes and, when possible, to regular algorithms (RAs) for systolic array implementation. The authors concentrate on the analysis of certain imperative nested loop programs in view of the ultimate objective, which is the (semi)-automatic design of systolic arrays from such initial behavioral specifications. The nested loop programs can be represented by a graph displaying the variable dependences between iterations. Characteristic properties of this dependence graph are given. In terms of this dependence graph, a procedure is described to transform the program into a single assignment program in which each variable takes on a unique value during the execution of the program. The method provides a systematic way to analyze data dependences in imperative nested loop programs. The approach is complementary to those used in parallelizing/vectorizing compilers. Instead of searching for independent iterations or statements (i.e. parallelism), iterations that are strictly dependent on each other are detected View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Linear arrays for residue mappers

    Publication Year: 1990 , Page(s): 309 - 316
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (232 KB)  

    Pipelined structures based on the residue number system (RNS) have been found suitable for high-speed arithmetic. The polynomial RNS (PRNS) can speed up digital signal processing (DSP)-related tasks like correlations and convolutions. The authors introduce pipelined arrays able to serve as mapping modules for PRNS-based functional units. Such mappings, involve polynomial evaluation coupled with modulo operations. The authors show how VLSI array processors can perform modulo operations in a parallel environment. A methodology is presented by which the reliability of such fast architectures can be ensured simply by probing into the mechanics of the computations involved. The proposed techniques provide a hardware base for PRNS implementations. At the same time, a reasonable degree of fault-tolerance can be guaranteed in the face of high system throughputs View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reconfigurable vector register windows for fast matrix computation on the orthogonal multiprocessor

    Publication Year: 1990 , Page(s): 202 - 213
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (540 KB)  

    The authors present the concept of vector register windows (VRWs) geared towards large scale matrix computation and image processing applications. The VRWs consist of multiple windows for vector registers providing parallel access and manipulation of large matrix data in the orthogonal multiprocessor (OMP). The number of windows and the number of registers in a window are dynamically reconfigurable over a range of values to match with the application problem size. An associated index manipulator provides programmable and on-the-fly data manipulation. The index manipulation feature is shown to be quite powerful for carrying out complex data manipulation functions like row (column) shift, row (column) exchange, matrix rotation, etc. Some matrix algorithms, efficiently utilizing the VRWs, are illustrated View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Systolic two-port adaptor for high performance wave digital filtering

    Publication Year: 1990 , Page(s): 379 - 388
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (416 KB)  

    The authors present a VLSI circuit for implementing wave digital filter (WDF) two-port adaptors. Considerable speedups over conventional designs have been obtained using fine grained pipelining. This has been achieved through the use of most significant bit (MSB) first carry-save arithmetic, which allows systems to be designed in which latency L is small and independent of either coefficient or input data wordlength. L is determined by the online delay associated with the computation required at each node in the circuit (in this case a multiply/add plus two separate additions). This in turn means that pipelining can be used to considerably enhance the sampling rate of a recursive digital filter. The level of pipelining which will offer enhancement is determined by L and is fine-grained rather than bit level. In the case of the circuit considered, L=3. For this reason pipeline delays, (half latches) have been introduced between every two rows of cells to produce a system with a once every cycle sample rate View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scheduling affine parameterized recurrences by means of

    Publication Year: 1990 , Page(s): 100 - 110
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (360 KB)  

    The authors present new scheduling techniques for systems of affine recurrence equations. They show that it is possible to extend earlier results on affine scheduling to the case when each variable of the system is scheduled independently of the others by an affine timing-function. This new technique makes it possible to analyze systems of recurrence equations with variables in different index spaces, and multi-step systolic algorithms. This theory applies directly to many problems, such as dynamic programming, LU decomposition, and 2-D convolution, and it avoids in particular preliminary heuristic rewriting of the equations View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fault-tolerant array processors using N-and-half-track switches

    Publication Year: 1990 , Page(s): 426 - 437
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (496 KB)  

    The author addresses the fault tolerance issue for rectangular arrays of a large number of processors. An array grid model based on n1/2-track switches is adopted. This model is a generalization of previous models using 1 1/2-track switches and 2 1/2-track switches. A reconfigurability theorem for n1/2 track arrays is established and a concept of pseudo processing elements (PEs) is introduced to decompose a routing problem into problems with smaller track numbers. Therefore, with the decomposition technique, only the routing algorithm developed for 1 1/2-track arrays is required. Simulation results for the 1 1/2-track array and 2 1/2-track array are given View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Two-level pipelined implementation of systolic block Householder transformation with application to RLS algorithm

    Publication Year: 1990 , Page(s): 758 - 769
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (456 KB)  

    The authors propose a systolic block Householder transformation (SBHT) approach to implement the Householder transformation (HT) on a systolic array as well as its application to the recursive-least-squares (RLS) algorithm. Since the data are fetched in a block manner, vector operations are in general required for the vectorized array. However, by using a modified HT algorithm, a two-level pipelined implementation can be used to pipeline the SBHT systolic array both at the vector and word levels. The throughput can be as fast as that of the Givens rotation method. The approach makes the HT amenable for VLSI implementation as well as applicable to real-time high throughput applications of modern signal processing. The constrained RLS problem using the SBHT RLS systolic array is also considered View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Embedding pyramids in array processors with pipelined busses

    Publication Year: 1990 , Page(s): 665 - 676
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (436 KB)  

    The concept of pipelined buses for parallel architectures diverges from the conventional exclusive access buses and offers both possibilities and challenges for significantly improving the efficiency of interprocessor communications in parallel computers. The authors present an efficient embedding of pyramids in array processors with pipelined buses. The embedding has the property that all the neighboring nodes in the pyramid are mapped to the same bus. Thus, any two neighbors in the embedded pyramid can communicate with each other using a single bus cycle View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementation of systolic algorithms using pipelined functional units

    Publication Year: 1990 , Page(s): 272 - 283
    Cited by:  Patents (15)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (588 KB)  

    The authors present a method to implement systolic algorithms (SAs) using pipelined functional units (PFUs). This kind of unit makes it possible to improve the throughput of a processor because of the possibility of initiating a new operation before the previous one has been completed. The method permits transformation of a SA so that it can be efficiently executed using PFUs. The method is based on two temporal transformations (slowdown and retiming) and one spatial transformation (coalescing). The temporal transformations permit the modification of the SA in such a way that dependences established by the PFU are preserved. The spatial transformation improves the hardware utilization. The method was applied to 1-D SAs with data contraflow. To demonstrate the effectiveness of the method, the authors describe an efficient implementation of a non-time-homogeneous SA with data contraflow for QR decomposition View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An improved multilayer neural model and array processor implementation

    Publication Year: 1990 , Page(s): 389 - 400
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (448 KB)  

    The authors present a method for obtaining faster learning in a multilayer neural network. The key ingredient is the concept of floating positive/negative thresholds used in the output neurons to interpret the output states. In a traditional multilayer perceptron, the output state is 1 or 0, depending on whether the activation value exceeds the fixed target threshold or not. The proposed approach determines the state of an output activation by comparing the difference between the output activation and the two floating positive/negative thresholds. If the output activation is closer to the positive threshold or negative threshold, then the output state is 1 or 0 respectively. The iterative learning process completes whenever it is decided that an activation value is closer to its target threshold. Simulation results show that the learning iterations for the model are a hundred times fewer than for the traditional multilayer perceptron. Mapping this multilayer neural net onto a ring systolic array maximizes the strength of VLSI in terms of intensive and pipeline computing View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Extensions to linear mapping for regular arrays with complex processing elements

    Publication Year: 1990 , Page(s): 156 - 167
    Cited by:  Papers (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (488 KB)  

    The optimal architectural design of the processing elements (PEs) for an application specific regular array (RA) is nontrivial if the application has a complex operation set. The authors present an approach that extends the conventional, linear time-space transformation for such cases. In application-specific-integrated-circuit (ASIC) architectures, one has the freedom to fine-tune all aspects of the architecture to optimize the throughput. Therefore, the PEs can be designed to match the throughput and to optimize the area-cost of an RA architecture. The method presented allows a free design of the PEs with internal pipelining of the data paths, hardware sharing of operators among operations, multicycle operators, and interleaving of the execution of different index points. Compared to methods that allow only parts of these experiments, the local area-time tradeoffs are now explicitly incorporated in the global space-time assignment problem View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reconfiguration of FFT arrays: a flow-driven approach

    Publication Year: 1990 , Page(s): 401 - 413
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (580 KB)  

    A new reconfiguration algorithm for defect and fault tolerance in fast Fourier transform (FFT) two-dimensional arrays is presented. The reconfiguration scheme is based on the data flow of the algorithm to minimize the overhead due to the re-routing of information in the reconfigured array. Evaluation of the effectiveness of this approach shows a significant increase in system robustness with respect to other, non-dedicated reconfiguration approaches. Moreover, the possibility of choosing between two reconfiguration algorithms characterized by different complexities and efficiencies results in both an optimal, host-driven reconfiguration (particularly suited for end-of-production yield enhancement) and a fast, self-performed reconfiguration (suited for on-line reliability enhancement) View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Byte-serial convolvers

    Publication Year: 1990 , Page(s): 530 - 541
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (476 KB)  

    It is shown that previously proposed bit-serial convolver schemes (with weights in parallel form), working with zero separation between samples, can be transformed into byte-serial input schemes with a comparable clock rate, thus affording an increase in sampling rate equal to the number of bits in each byte. This is achieved by adopting a modified carry save circuit. The proposed schemes are based on a modified version of serial-parallel multipliers and on the use of pre-computed multiples of the weights. The case of 2-bit bytes is fully developed. It is shown that the use of samples represented in a biased binary number system leads to schemes that are only slightly more complex than the corresponding bit-serial schemes. The bit rate is determined by the delays of a full adder and a flip-flop. The schemes are composed by a number of bit-slices and appear to be easily partitionable in identical cascaded modules suitable for a fault tolerant architecture and a WSI implementation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Systolic VLSI compiler (SVC) for high performance vector quantisation chips

    Publication Year: 1990 , Page(s): 145 - 155
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (388 KB)  

    An overview is given of a systolic VLSI compiler (SVC) tool currently under development for the automated design of high performance digital signal processing (DSP) chips. Attention is focused on the design of systolic vector quantization chips for use in both speech and image coding systems. The software in question consists of a cell library, silicon assemblers, simulators, test pattern generators, and a specially designed graphics shell interface which makes it expandable and user friendly. It allows very high performance digital coding systems to br rapidly designed in VLSI View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Designing specific systolic arrays with the API15C chip

    Publication Year: 1990 , Page(s): 505 - 517
    Cited by:  Papers (2)  |  Patents (24)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (508 KB)  

    The API15C processor, a building block for different systolic structures, is designed exclusively for single-instruction-multiple data (SIMD) execution mode. To support this mode, the instruction set includes special control instructions. Three parallel I/O ports are available for different interconnection schemes. The API15C chip is designed in a CMOS 2-μm technology. It contains 45000 transistors on a 6-mm $M6.2-mm silicon area. The functionality of the circuit was tested successfully after the first run. It executes one instruction per clock phase of 100 ns, giving a global rate of 10 MIPS. To validate this processing element as a building block for systolic structures, a programmable interface and two single board machines were developed. The first is an 18 processor linear structure able to support a wide range of applications. The second is a 28 processor bidimensional structure for a specific application of string comparison. The instruction set is particularly well-suited for SIMD operation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.