By Topic

Application-Specific Systems, Architectures and Processors, 1997. Proceedings., IEEE International Conference on

Date 14-16 July 1997

Filter Results

Displaying Results 1 - 25 of 54
  • Proceedings IEEE International Conference on Application-Specific Systems, Architectures and Processors

    Publication Year: 1997
    Save to Project icon | Request Permissions | PDF file iconPDF (301 KB)  
    Freely Available from IEEE
  • An approach for quantitative analysis of application-specific dataflow architectures

    Publication Year: 1997 , Page(s): 338 - 349
    Cited by:  Papers (62)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (752 KB)  

    In this paper we present an approach for quantitative analysis of application-specific dataflow architectures. The approach allows the designer to rate design alternatives in a quantitative way and therefore supports him in the design process to find better performing architectures. The context of our work is video signal processing algorithms which are mapped onto weakly-programmable, coarse-grain dataflow architectures. The algorithms are represented as Kahn graphs with the functionality of the nodes being coarse-grain functions. We have implemented an architecture simulation environment that permits the definition of dataflow architectures as a composition of architecture elements, such as functional units, buffer elements and communication structures. The abstract, clock-cycle accurate simulator has been built using a multi-threading package and employs object oriented principles. This results in a configurable and efficient simulator. Algorithms can subsequently be executed on the architecture model producing quantitative information for selected performance metrics. Results are presented for the simulation of a realistic application on several dataflow architecture alternatives, showing that many different architectures can be simulated in modest time on a modern workstation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Index of authors

    Publication Year: 1997 , Page(s): 539 - 540
    Save to Project icon | Request Permissions | PDF file iconPDF (83 KB)  
    Freely Available from IEEE
  • Architectural approaches for video compression

    Publication Year: 1997 , Page(s): 176 - 185
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (584 KB)  

    An overview on architectures for implementations of current video compression schemes is given. Dedicated as well as programmable approaches are discussed. Examples for dedicated function-specific implementations include architectures for DCT and block matching. For programmable video signal processors, a number of architectural measures to increase video compression performance are reviewed. Actual implementations of video compression schemes typically employ a variety of different architectural approaches. The detailed mix of approaches depends on the targeted application spectrum View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A logical framework to prove properties of ALPHA programs

    Publication Year: 1997 , Page(s): 187 - 198
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (640 KB)  

    We present an assertional approach to prove properties of ALPHA programs. ALPHA is a functional language based on affine recurrence equations. We first present two kinds of operational semantics for ALPHA together with some equivalence and confluence properties of these semantics. We then present an attempt to provide ALPHA with an external logical framework. We therefore define a proof method based on invariants. We focus on a particular class of invariants, namely canonical invariants, that are a logical expression of the program's semantics. We finally show that this framework is well-suited to prove partial properties, equivalence properties between ALPHA programs and properties that we cannot express within the ALPHA language View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Three-dimensional orthogonal tile sizing problem : mathematical programming approach

    Publication Year: 1997 , Page(s): 209 - 218
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (532 KB)  

    We discuss in this paper the problem of finding the optimal tiling transformation of three-dimensional uniform recurrences on a two-dimensional torus/grid of distributed-memory general-purpose machines. We show that even for the simplest case of recurrences which allows for such transformation, the corresponding problem of minimizing the total running time is a non-trivial non-linear integer programming problem. For the later we derive an O(1) algorithm for finding a good approximation solution. The theoretical evaluations and the experimental results show that the obtained solution approximates the original minimum sufficiently well in the context of the considered problem. Such analytical results are of obvious interest and can be successfully used in parallelizing compilers as well as in performance tuning of parallel codes View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient implementation of rotation operations for high performance QRD-RLS filtering

    Publication Year: 1997 , Page(s): 162 - 174
    Cited by:  Papers (9)  |  Patents (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (740 KB)  

    In this paper we present practical techniques for implementing Givens rotations based on the well-known CORDIC algorithm. Rotations are the basic operation in many high performance adaptive filtering schemes as well as numerous other advanced signal processing algorithms relying on matrix decompositions. To improve the efficiency of these methods, we propose to use “approximate rotations”, whereby only a few (i.e. r≪b, where b is the operand word length) elementary angles of the original CORDIC sequence are applied, so as to reduce the total number of required shift add operations. This seamingly rather ad hoc and heuristic procedure constitutes a representative example of a very useful design concept termed “approximate signal processing” recently introduced and formally exposed by S.H. Nawab et al. (1997), concerning the trade-off between system performance and implementation complexity, i.e. between accuracy and resources. This is a subject of increasing importance with respect to the efficient realization of demanding signal processing tasks. We present the application of the described rotation schemes to QRD-RLS filtering in wireless communications, specifically high speed channel equalization and beamforming, i.e. for intersymbol and co-channel/interuser interference suppression, respectively. It is shown via computer simulations that the convergence behavior of the scheme using approximate Givens rotations is insensitive to the value of r, and that the misadjustment error decreases as r is increased, opening zip possibilities for “incremental refinement” strategies View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Determination of the processor functionality in the design of processor arrays

    Publication Year: 1997 , Page(s): 199 - 208
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (524 KB)  

    In this paper the inclusion of hardware constraints into the design of massively parallel processor arrays is considered. We propose an algorithm which determines an optimal scheduling function as well as the selection of components which have to be implemented in one processor of a processor array. The arising optimization problem is formulated as an integer linear program which also takes the necessary chip area of a hardware implementation into consideration. Thereby we assume that an allocation function is given and that a partitioning of the processor array is required to match a limited chip area in silicon View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Heterogeneous multiprocessor scheduling and allocation using evolutionary algorithms

    Publication Year: 1997 , Page(s): 294 - 303
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (636 KB)  

    We propose a novel stochastic approach for the problem of multiprocessor scheduling and allocation under timing and resource constraints using an evolutionary algorithm (EA). For composite schemes of DSP algorithms a compact problem encoding has been developed with emphasis on the allocation/binding part of the problem as well as an efficient problem transformation-decoding scheme in order to avoid infeasible solutions and therefore time consuming repair mechanisms. Thus, the algorithm is able to handle even large size problems within moderate computation time. Simulation results comparing the proposed EA with optimal results provided by mixed integer linear programming (MILP) show, that the EA is suitable to achieve the same or similar results but in much less time as problem size increases View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A flexible VLSI architecture for variable block size segment matching with luminance correction

    Publication Year: 1997 , Page(s): 479 - 488
    Cited by:  Papers (1)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (628 KB)  

    This paper describes a flexible 25.6 Giga operations per second exhaustive search segment matching VLSI architecture to support evolving motion estimation algorithms as well as block matching algorithms of established video coding standards. The architecture is based on a 16×16 processor element (PE) array and a 12 kbyte on-chip search area RAM and allows concurrent calculation of motion vectors for 32×32, 16×16, 8×8 and 4×4 blocks and partial quadtrees (called segments)for a +/-32 pel search range with 100% PE utilization. This architecture supports object based algorithms by excluding pixels outside of video objects from the segment matching process as well as advanced algorithms like variable blocksize segment matching with luminance correction. A preprocessing unit is included to support halfpel interpolation and pixel decimation. The VLSI has been designed using VHDL synthesis and a 0.5 μm CMOS technology. The chip will have a clock rate of 100 MHz (min.) allowing realtime variable blocksize segment matching of 4CIF video (704×576 pel) at 15 fps or luminance corrected variable blocksize segment matching at above CIF (352×288), 15 fps resolution View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • PART: a partitioning tool for efficient use of distributed systems

    Publication Year: 1997 , Page(s): 328 - 337
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (564 KB)  

    The interconnection of geographically distributed supercomputers via high-speed networks allows users to access the needed compute power for large-scale, complex applications. For efficient use of such systems, the variance in processor performance and network (i.e., interconnection network versus wide area network) performance must be considered. In this paper, we present a decomposition tool, called PART, for distributed systems. PART takes into consideration the variance in performance of the networks and processors as well as the computational complexity of the application. This is achieved via the parameters used in the objective function of simulated annealing. The initial version of PART focuses on finite element based problems. The results of using PART demonstrate a 30% reduction in execution time as compared to using conventional schemes that partition the problem domain into equal-sized subdomains View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A novel sequencer hardware for application specific computing

    Publication Year: 1997 , Page(s): 392 - 401
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (716 KB)  

    This paper introduces a powerful novel sequencer for controlling computational machines and for structured DMA (direct memory access) applications. It is mainly focused on applications using 2-dimensional memory organization, where most inherent speed-up is obtained thereof. A classification scheme of computational sequencing patterns and storage schemes is derived. In the context of application specific computing the paper illustrates its usefulness especially for data sequencing-recalling examples hereafter published earlier, as far as needed for completeness. The paper also discusses, how the new sequencer hardware provides substantial speed-up compared to traditional sequencing hardware use View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mapping multirate dataflow to complex RT level hardware models

    Publication Year: 1997 , Page(s): 283 - 292
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (596 KB)  

    The design of digital signal processing systems typically consists of an algorithm development phase carried out at a behavioral level and the selection of an efficient hardware architecture for implementation. In order to speed up the joint optimization of algorithms and architectures, a fast path to implementation must be provided. This can be achieved efficiently by directly mapping the data flow specification of the system to an RTL target architecture by means of HDL code generation. For algorithm design, communication systems are most easily modeled using multirate data flow graphs in which no notion of time is maintained. HDL code generation introduces a cycle based timing model and maps the data flow models to RTL implementations, which are usually taken from a library. Due to the increase in ASIC design complexity, these building blocks reach a high level of functionality and have complex interfacing properties. Therefore, it becomes necessary to generate additional interfacing and controlling hardware to synthesize an operable system. In this paper, we present a new approach of mapping multirate dataflow graphs to complex RTL hardware models and derive algorithms to synthesize these high-level RTL building blocks into a complete operable system View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast arithmetic and fault tolerance in the FERMI system

    Publication Year: 1997 , Page(s): 374 - 383
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (544 KB)  

    The FERMI is a data acquisition system for calorimetry experiments in high energy physics at the LHC, CERN. The system contains a large number of acquisition channels, with a precision of 16 bits and a sampling rate of 40 MHz. A large part of the information driven by the channels is processed locally, to reduce the amount of data. This requires to cluster several channels by adding them. The paper presents the design of a fast, low cost adder chip, based on the implementation of column compression techniques for the computation of integer addition. Since the system is operating in a radiation-hard environment, fault tolerance (namely fault detection) is implemented by means of arithmetic codes View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Buffer size optimization for full-search block matching algorithms

    Publication Year: 1997 , Page(s): 76 - 85
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (512 KB)  

    This paper presents how to find optimized buffer size for VLSI architectures of full-search block matching algorithms. Starting from the DG (dependency graph) analysis, we focus in the problem of reducing the internal buffer size under minimal I/O bandwidth constraint. As a result, a systematic design procedure for buffer optimization is derived to reduce realization cost View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A methodology for user-oriented scalability analysis

    Publication Year: 1997 , Page(s): 304 - 315
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (620 KB)  

    Scalability analysis provides information about the effectiveness of increasing the number of resources of a parallel system. Several methods have been proposed which use different approaches to provide this information. This paper presents a family of analysis methods oriented to the user. The methods in this family should assist the user in estimating the benefits when increasing the system size. The key issue in the proposal is the appropriate combination of a scaling model, which reflects the way the users utilize an increasing number of resources, and a figure of merit that the user wants to improve with the larger system. Another important element in the proposal is the approach to characterize the scalability, which enables quick visual analyses and comparisons. Finally, three concrete examples of methods belonging to the proposed family are introduced in this paper View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scheduling in co-partitioned array architectures

    Publication Year: 1997 , Page(s): 219 - 228
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (592 KB)  

    We consider a balanced combined application of the known LPGS- and LSGP-partitioning which we call co-partitioning. This approach allows a structural adjustment of the array design as well as a balancing of the size of the local memory and the IO-demand between the processing elements of the co-partitioned array. We determine the size of the LSGP-partitions such that there exists a sequential scheduling within the LSGP-partitions which is free of wait states. We give the proof for the existence of such a scheduling, and we give explicit formulas for the lower and upper bounds of the loops of a for-loop program which represents one of the possible sequential schedulings View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Conception and design of a RISC CPU for the use as embedded controller within a parallel multimedia architecture

    Publication Year: 1997 , Page(s): 412 - 421
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (528 KB)  

    In this paper, the problem of defining a high performance control structure for a parallel motion estimation architecture for MPEG2 coding is addressed. Various design and architecture choices are discussed and the final architecture is described. It represents a combined MIMD-SIMD approach which is based on a small but efficient ASIP with subword parallelism View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic data mapping of signal processing applications

    Publication Year: 1997 , Page(s): 350 - 362
    Cited by:  Papers (3)  |  Patents (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (752 KB)  

    This paper presents a technique to map automatically a complete digital signal processing (DSP) application onto a parallel machine with distributed memory. Unlike other applications where coarse or medium grain scheduling techniques can be used, DSP applications integrate several thousand of tasks and hence necessitate fine grain considerations. Moreover finding an effective mapping imperatively require to take into account both architectural resources constraints and real time constraints. The main contribution of this paper is to show how it is possible to handle and to solve data partitioning, and fine-grain scheduling under the above operational constraints using concurrent constraints logic programming languages (CCLP). Our concurrent resolution technique undertaking linear and nonlinear constraints takes advantage of the special features of signal processing applications and provides a solution equivalent to a manual solution for the representative panoramic analysis (PA) application View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient architecture for the in place fast cosine transform

    Publication Year: 1997 , Page(s): 499 - 508
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (476 KB)  

    The cosine transform (DCT) is in the core of image encoding and compression applications. We present a new architecture to efficiently compute the fast direct and inverse cosine transform which is based on reordering the butterflies after their computation. The designed architecture exploits locality, allowing pipelining between stages and saving memory (in place). The result is an efficient architecture for high speed computation of the DCT that reduces significantly the area required to VLSI implementation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Low latency word serial CORDIC

    Publication Year: 1997 , Page(s): 124 - 131
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (396 KB)  

    In this paper we present a modification of the CORDIC algorithm which reduces the number of iterations almost to half by merging two successive iterations of the basic algorithm. The two coefficients per iteration are obtained with only a small increase in the cycle time by estimating one of the coefficients. A correcting iteration method is used to correct the possible errors produced by the estimate. Moreover, the modified iteration permits the reduction of the number of cycles required for the compensation of the scaling factor. The resulting architecture is word serial, working both in rotation and vectoring operation modes, presenting a low latency in comparison with the classical CORDIC approach View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A multiprocessor system for real time high resolution image correlation

    Publication Year: 1997 , Page(s): 384 - 391
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (456 KB)  

    In this paper a dedicated multiprocessor architecture for a real time implementation of the normalized cross correlation function (NCCF) on images up to 1024x1024 pixels is presented. The computational requirements are dramatically reduced by calculating this algorithm in the frequency domain. In contrast to a standard implementation of the NCCF which inherently imposes rectangular templates, the proposed enhanced method also allows to search for free-form templates which even may include holes. The computation in the frequency domain is based on a single program multiple data (SPMD) architecture which includes a dedicated ASIC for the computation of the 1D complex FFT. Besides this specific part of the system, the image pre- and post- processing tasks are supported by general purpose DSP's. A system consisting of 4 ASIC's and 2 Sharc DSP's is able to compute the enhanced NCCF of a free form template on images of 1024x1024 pixels within 134 ms (8 frames/s) View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A VLSI architecture for image geometrical transformations using an embedded core based processor

    Publication Year: 1997 , Page(s): 86 - 95
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (472 KB)  

    This paper presents a circuit dedicated to real time geometrical transforms of pictures. The supported transforms are third degree polynomials of two variables. The post-processing is performed by a bilinear filter. An embedded DSP core is in charge of high level, low rate, control tasks while a set of hard wired units is in charge of computing intensive low level tasks View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design methodology for digital signal processing

    Publication Year: 1997 , Page(s): 468 - 477
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (552 KB)  

    Improvements in semiconductor integration density and the resulting problem of having to manage designs of increasing complexity is an old one, but still current. The new challenge lies in a new level of architecture heterogeneity, e.g. mixing hard-wired digital circuits with software programmed signal processors on one die. Hence, we are moving by one level of abstraction from semi-custom standard-cells to semi-custom `block cells'. This results in a new dimension in the gap between algorithm/system design and architecture/circuit design, not addressed by any tools sufficiently yet today. This paper presents a method of analyzing the problem by orthogonalizing algorithms into data transfer and data manipulation, and carrying this over to the control and I/O design as well. This approach might be a promising basis for flexibly mapping the algorithms onto future `block cell' designs, and furthermore for designing new system simulation tools which allow for tools to be integrated for a flexible mapping of algorithms onto various different hardware architecture domains View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A datapath generator for full-custom macros of iterative logic arrays

    Publication Year: 1997 , Page(s): 438 - 447
    Cited by:  Papers (13)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (864 KB)  

    A new flexible datapath generator which allows the automated design of full-custom macros covering dedicated filter structures as well as programmable DSP cores is presented. The underlying concept combines the advantages of full-custom designs concerning power dissipation, silicon area, and throughput rate with a moderate design effort. In addition, the datapath generator can be easily included in existing semi-custom design flows. This enables highly efficient VLSI implementations of optimized full-custom macros (datapaths) embedded into synthesized standard cell designs covering uncritical structures in terms of area, power, and throughput (e.g. control paths) using common design flows. In order to demonstrate the datapath generator assisted design flow, the implementation of a time-shared correlator is presented as an example View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.