By Topic

Application-Specific Systems, Architecture Processors, 2005. ASAP 2005. 16th IEEE International Conference on

Date 23-25 July 2005

Filter Results

Displaying Results 1 - 25 of 69
  • 16th International Conference on Application-Specific Systems, Architecture and Processors

    Save to Project icon | Request Permissions | PDF file iconPDF (113 KB)  
    Freely Available from IEEE
  • 16th International Conference on Application-Specific Systems, Architecture and Processors - Title Page

    Page(s): i - iii
    Save to Project icon | Request Permissions | PDF file iconPDF (90 KB)  
    Freely Available from IEEE
  • 16th International Conference on Application-Specific Systems, Architecture and Processors - Copyright Page

    Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (82 KB)  
    Freely Available from IEEE
  • 16th International Conference on Application-Specific Systems, Architecture and Processors - Table of contents

    Page(s): v - viii
    Save to Project icon | Request Permissions | PDF file iconPDF (108 KB)  
    Freely Available from IEEE
  • Message from the Conference Chairs

    Page(s): ix
    Save to Project icon | Request Permissions | PDF file iconPDF (69 KB)  
    Freely Available from IEEE
  • Conference Organizers

    Page(s): xi
    Save to Project icon | Request Permissions | PDF file iconPDF (52 KB)  
    Freely Available from IEEE
  • Program Committee

    Page(s): xii
    Save to Project icon | Request Permissions | PDF file iconPDF (50 KB)  
    Freely Available from IEEE
  • External referees

    Page(s): xiii
    Save to Project icon | Request Permissions | PDF file iconPDF (50 KB)  
    Freely Available from IEEE
  • Area - time - power and design effort: the basic tradeoffs in application specific systems

    Page(s): 3 - 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (144 KB)  

    Application specific is always a tradeoff among competing design goals (or design parameters). In addition to the well established area (cost) - time (performance) - power metrics specific applications imply a relatively limited market so design cost becomes an especially important consideration. As technology offers increasing transistor density with lower cost power constraints limit frequency as the primary avenue to performance. The alternative is to use area (transistors) to recover performance putting an additional strain on the design budget. The search for flexibility in design without paying a significant area - time - power cost remains the primary problem for application specific and system on a chip (SoC) design. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using symbolic feasibility tests during design space exploration of heterogeneous multi-processor systems

    Page(s): 9 - 14
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (248 KB) |  | HTML iconHTML  

    The task of automatic design space exploration of heterogeneous multi-processor systems is often tackled with evolutionary algorithms. In this paper, we propose a novel approach in combining evolutionary algorithms with symbolic techniques in order to improve the convergence speed. The main idea is to guide the search towards the feasible region by utilizing symbolic techniques. We present experimental results showing the advantages of our novel approach, especially when the search space contains only few feasible solutions, what is often the case when designing heterogeneous multi-processor systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Expression synthesis in process networks generated by LAURA

    Page(s): 15 - 21
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (373 KB) |  | HTML iconHTML  

    The COMPAAN/LAURA (Stefanov et al., 2004) tool chain maps nested loop applications written in Matlab onto reconfigurable platforms, such as FPGAs. COMPAAN rewrites the original Matlab application as a process network in which the control is parameterized and distributed. This control is given as parameterized polytopes that are expressed in terms of pseudo-linear expressions. These expressions cannot always be mapped efficiently onto hardware as they contain multiplication and integer division operations. This obstructs the data flow through the processes. Therefore, we present in this paper the expression compiler that efficiently maps pseudo-linear expressions onto a dedicated hardware data-path in such a way that the distributed and parameterized control never obstructs the data flow through processors. This compiler employs techniques like number theory axioms, method of difference, and predicated static single assignment code. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Artificial deadlock detection in process networks for ECLIPSE

    Page(s): 22 - 27
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (328 KB) |  | HTML iconHTML  

    Kahn process network (KPN) is a popular model of computation for describing streaming applications. In a KPN model, processes communicate through unbounded unidirectional FIFOs. When theoretically unbounded FIFOs are implemented using finite memory, artificial deadlocks can occur due to one or more FIFOs having insufficient sizes. Generally, a system designer must be able to make a design time trade-off between execution time and memory usage, preferably using no more memory than required for obtaining a certain execution time. But it is practically impossible to decide at design time, FIFO sizes that are sufficient to run the application without any artificial deadlocks. Hence, there is a need for runtime mechanism for handling the artificial deadlock situations in process networks. Existing mechanisms detect artificial deadlocks only after all KPN processes block. This results in excessive blocking of processes and an application that appears to hang. In this paper, we present an improved mechanism for early detection of artificial deadlocks and its implementation on ECLIPSE (extended CPU local irregular processing architecture), an application domain specific architecture. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hardware/software interface for multi-dimensional processor arrays

    Page(s): 28 - 35
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (320 KB) |  | HTML iconHTML  

    On most recent systems on chip, the performance bottleneck is the on-chip communication medium, bus or network. Multimedia applications require a large communication bandwidth between the processor and graphic hardware accelerators, hence an efficient communication scheme using burst mode is mandatory. In the context of data-flow hardware accelerators, we approach this problem as a classical resource-constrained problem. We explain how to use recent optimization techniques so as to define a conflict-free schedule of input/output for multi-dimensional processor arrays (e.g. 2D grids). This schedule is static and allows us to perform further optimizations such as grouping successive data in packets to operate in burst mode. We also present an effective VHDL implementation on FPGA and compare our approach to a run-time congestion resolution showing important gains in hardware area. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Casablanca II: implementation of a real-time RISC core for embedded systems

    Page(s): 36 - 42
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (224 KB) |  | HTML iconHTML  

    We extended general-purpose RISC processor architecture and developed a new RISC core, Casablanca II, for supporting real-time processing in embedded systems. The processor core has multiple register-sets and achieves fast context-switching by automatically changing the active register-set and reducing overheads to save and restore the contents of the registers when exceptions or interruptions occur. In addition, the core has mechanisms for explicit data cache control, enabling data prefetching and fast DMA, which is invoked by executing extended instructions. In this paper, we describe the organization of Casablanca II developed by using an ASIC process and present preliminary evaluation of the processor. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Behavioral specification of control interface for signal processing applications

    Page(s): 43 - 49
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (240 KB) |  | HTML iconHTML  

    Data-driven applications that are mapped onto large-scale systems need to be controlled for (re-)configuration, test and monitoring. Such systems consist of distributed and heterogeneous components. We specify the behavior of a control network independent of the system architecture and independent of the data-driven application. We address particularly the key problem of the interfacing between the control model and the data-driven application model. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speedups from partitioning critical software parts to coarse-grain reconfigurable hardware

    Page(s): 50 - 55
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (216 KB) |  | HTML iconHTML  

    In this paper, we propose a hardware/software partitioning method for improving applications' performance in embedded systems. Critical software parts are accelerated on hardware of a single-chip generic system comprised by an embedded processor and coarse-grain reconfigurable hardware. The reconfigurable hardware is realized by a 2D array of processing elements. The partitioning flow utilizes an analysis procedure at the basic-block level for detecting kernels in software. A list-based mapping algorithm has been developed for estimating the execution cycles of kernels on coarse-grain reconfigurable arrays. The proposed partitioning flow has been largely automated for a program description in C language. Extensive hardware/software experiments on five real-life applications are presented. It is shown that the benchmarks spend an average of 69% of their instruction count in 11% on average of their code that correspond to the kernels' code. The results illustrate that by mapping critical code on coarse-grain reconfigurable hardware, speedups ranging from 1.2 to 3.7, with an average value of 2.2, are achieved. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A SW/configware codesign methodology for control dominated applications

    Page(s): 56 - 61
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (408 KB) |  | HTML iconHTML  

    In this paper, we present a partitioning methodology targeting a dynamically reconfigurable architecture. We first identify this class of architectures and point out the lack of underlying tools and compiler support exploiting, with these architectures, the potential task level parallelism (TLP). The applications in today's and the future embedded systems are more and more control dominated making necessary the use of a specification that handles jointly treatment and control. Our methodology starts from a system level specification in safe state machines (SSM: the graphical formalism of ESTEREL) integrating task level granularity treatments (as C function calls) in a control flow environment. After simulation and formal proof, we explicitly partition all the different configurations of the SSM i.e. the different combinations of control and treatment that the system has to perform at each operational tick. We also develop a technique that enables to contain the explosion of the number of these configurations and establish the efficiency of our method through its application to a video supervision application and to the JPEG 2000 standard. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Towards a framework for system-level design of multiprocessor SoC platforms for media processing

    Page(s): 65 - 72
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (256 KB) |  | HTML iconHTML  

    Recently, a number of event-centric models have been proposed for analyzing multimedia applications running on multiprocessor system-on-chip (SoC) platforms. This has given shape to a general framework using which different timing and performance analysis questions can be answered in a single coherent manner. Central to this framework is a model for expressing the timing properties associated with different multimedia streams and a means for computing how these properties change as a stream gets successively processed by the different processors of a platform. In contrast to standard event models like periodic or sporadic, this model can accurately capture the data-dependent execution time variabilities associated multimedia tasks and the burstiness of on-chip traffic resulting from multimedia processing. In this paper, we give a high-level view of this framework, describe setups which currently can be modelled using it, and identify possible directions in which this framework should be extended to make it more usable. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Communication-centric SoC design for nanoscale domain

    Page(s): 73 - 78
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (576 KB) |  | HTML iconHTML  

    In the realm of 35nm technology, it becomes possible to have thousands of IP blocks that need to communicate efficiently. Large-scale integration of these blocks onto a single chip makes the use of truly scalable networks-on-chips (NoC) communication architectures inevitable. This paper provides an overview of the outstanding research issues involved in designing application-specific NoC architectures by considering explicitly the level of customization envisioned in the communication architecture. For each category of approaches, we discuss the significance of the problem, provide a problem statement and survey the relevant solutions to date. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using TLM for exploring bus-based SoC communication architectures

    Page(s): 79 - 85
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (280 KB) |  | HTML iconHTML  

    As billion transistor system-on-chips (SoC) become commonplace and design complexity continues to increase, designers are faced with the daunting task of meeting escalating design requirements in shrinking time-to-market windows, and have begun using an IP-based SoC design methodology that permits reuse of key SoC functional components. Since the communication architectures connecting components in these SoC designs significantly impact system performance, it is imperative that designers explore the communication design space efficiently, quickly and early in the design flow. Transaction level modeling (TLM) is an emerging abstraction that facilitates early exploration of SoC architectures. This paper outlines a typical IP-based SoC design flow, and presents the cycle count accurate at transaction boundaries (CCATB) modeling abstraction which is a fast, efficient and flexible approach for exploring bus-based communication architectures in SoC designs. The CCATB models not only take less time to model but are also faster to simulate than existing modeling abstractions for communication architecture exploration such as pin-accurate BCA (PA-BCA) and transaction based BCA (T-BCA). Experimental results on several industrial SoC subsystem case studies show that CCATB models are faster than PA-BCA by as much as 120% on average and by 67% on average when compared to T-BCA, demonstrating the advantages of CCATB-based TLM abstraction for exploring bus-based SoC communication architectures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploring design space of VLIW architectures

    Page(s): 86 - 91
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (288 KB) |  | HTML iconHTML  

    Architectures based on very long instruction word (VLIW) have found fertile ground in multimedia electronic appliances thanks to their ability to exploit high degrees of instruction level parallelism (ILP) with a reasonable tradeoff in complexity and silicon costs. Effective compiler support for predicated execution using the hyperblock, drastically increases the ILP even for control-dominated applications in which the branch instruction frequency is very high. The use of these techniques, however, is known to increase the instruction footprint, consequently putting pressure on the memory hierarchy. In this paper, we evaluate the performance/power trade-off in a system comprising a VLIW processor and a two-level hierarchical memory subsystem. Via simulation, we show that the efficiency of a compiler that is able to exploit predicate execution by hyperblock formation is greatly affected by the configuration of the memory subsystem as well as the configurable processor parameters. The enabling or disabling of hyperblock formation should therefore not be evaluated separately or independently, but seen as a further free parameter to be tuned in a strategy of design space exploration. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The midlifekicker microarchitecture evaluation metric

    Page(s): 92 - 97
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (632 KB) |  | HTML iconHTML  

    We introduce the midlfekicker metric for evaluating microarchitectures mostly during the design process. We assume a microarchitecture designed at a time T-1 and estimate if a new microarchitecture projected for time T has advantages over the microarchitecture designed at T-1 and remapped on the same technology at time T. We consider that microarchitects minimize the product cycles per instruction (CPI) x cycle time and estimate performance based on CPI with a soft-threshold to include cycle time product effects. Some measurements are also reported. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design of a hardware accelerator for density based clustering applications

    Page(s): 101 - 106
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (200 KB) |  | HTML iconHTML  

    Data mining is beginning to be widely used in various application fields. Density based clustering algorithms perform data mining by grouping high-density regions of points to form clusters. In recent years, the data sizes and the problem complexity of this algorithm have increased significantly leading to slower execution of the applications. Faster engines that perform the application tasks quickly and efficiently are the need of the hour. In this paper, we propose a hardware accelerator for density based clustering applications. This accelerator improves the execution speed of the core kernels of this application, which include density calculation and the migration of points to denser regions. We show that this accelerator when integrated with general purpose processors, speed up the kernel execution times by at least 300X. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Complex fixed-point matrix inversion using transport triggered architecture

    Page(s): 107 - 112
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (336 KB) |  | HTML iconHTML  

    Fixed-point simulations for inverting matrices using transport triggered architectures are performed. Several methods are implemented in fixed-point: the Cholesky decomposition as a direct method, Newton iterations as an iterative method, and Strassen Newton algorithm as a combined recursive method. Fixed-point implementations of these matrix inversion algorithms are tested and analyzed. A division-free implementation is targeted. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A parallel automaton string matching with pre-hashing and root-indexing techniques for content filtering coprocessor

    Page(s): 113 - 118
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (272 KB) |  | HTML iconHTML  

    We propose a new parallel automaton string matching approach and its hardware architecture for content filtering coprocessor. This new approach can improve the average matching time of the parallel automaton with pre-hashing and root-indexing techniques. The pre-hashing technique uses a hashing function to verify quickly the text against the partial patterns in the automaton, and the root-indexing technique matches multiple bytes for the root state in one single matching. A popular automaton algorithm, Aho-Corasick (AC) is chosen to be implemented by adding the two techniques; we employ these two techniques in a memory efficient version of AC namely bitmap AC. For the average-case time, our approach improves bitmap AC by 494% and 224% speedup for URL and virus patterns, respectively. Since pre-hashing and root-indexing techniques can be concurrently executed with bitmap AC in the hardware, our proposed approach has the same worst-case time as bitmap AC. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.