By Topic

Systems, Architectures, Modeling, and Simulation, 2009. SAMOS '09. International Symposium on

Date 20-23 July 2009

Filter Results

Displaying Results 1 - 25 of 33
  • [Front cover]

    Publication Year: 2009 , Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (36 KB)  
    Freely Available from IEEE
  • [Title page]

    Publication Year: 2009 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (57 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Publication Year: 2009 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (75 KB)  
    Freely Available from IEEE
  • Preface

    Publication Year: 2009 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (60 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • IC-SAMOS organization

    Publication Year: 2009 , Page(s): 1 - 2
    Save to Project icon | Request Permissions | PDF file iconPDF (29 KB)  
    Freely Available from IEEE
  • list-reviewer

    Publication Year: 2009 , Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (27 KB)  
    Freely Available from IEEE
  • Table of contents

    Publication Year: 2009 , Page(s): 1 - 2
    Save to Project icon | Request Permissions | PDF file iconPDF (45 KB)  
    Freely Available from IEEE
  • Mobile visual computing

    Publication Year: 2009 , Page(s): i
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (265 KB)  

    Summary form only given. I will talk about camera phones, how you can use camera as a sensor that gives natural access to the information about the real world around you (mobile augmented reality) and how you can combine general computation capability to combine several input images into better or more interesting output images (mobile computational photography). I will also discuss about mobile graphics and the latest development on the HW and APIs (OpenGL ES, OpenMAX IL, OpenCL) that allow using graphics HW for these applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • “Slower than you think” — The evolution of processor and SoC architectures

    Publication Year: 2009 , Page(s): ii
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (268 KB)  

    Research projects talk about thousands of processing elements on an SoC. Various commercial companies talk about their specialised homogeneous or heterogeneous processing arrays. Graphics devices are being applied for solving a variety of computing problems outside their design domain. Finally, it seems that everyone, and their brother, is offering a multicore or multiprocessor programming model to bring all this technology under some control - and if we don't use it, we should all panic. This talk will focus on where we are and where we are likely to go with the evolution of processors and SoC architectures for embedded applications. Driven from a concrete industrial perspective, I will discuss some of the progress made in exploiting advances in processor technology and multiprocessor SoC. I will also discuss some possible future scenarios for evolution in these areas. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A mixed hardware-software approach to flexible Artificial Neural Network training on FPGA

    Publication Year: 2009 , Page(s): 1 - 8
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (557 KB) |  | HTML iconHTML  

    FPGAs offer a promising platform for the implementation of artificial neural networks (ANNs) and their training, combining the use of custom optimized hardware with low cost and fast development time. However, purely hardware realizations tend to focus on throughput, resorting to restrictions on applicable network topology or low-precision data representation, whereas flexible solutions allowing a wide variation of network parameters and training algorithms are usually restricted to software implementations. This paper proposes a mixed approach, introducing a system-on-chip (SoC) implementation where computations are carried out by a high efficiency neural coprocessor with a large number of parallel processing elements. System flexibility is provided by on-chip software control and the use of floating-point arithmetic, and network parallelism is exploited through replicated logic and application-specific coprocessor architecture, leading to fast training time. Performance results and design limitations and trade-offs are discussed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • High-speed FPGA-based implementations of a Genetic Algorithm

    Publication Year: 2009 , Page(s): 9 - 16
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (584 KB) |  | HTML iconHTML  

    One very promising approach for solving complex optimizing and search problems is the Genetic Algorithm (GA) one. Based on this scheme a population of abstract representations of candidate solutions to an optimization problem gradually evolves toward better solutions. The aim is the optimization of a given function, the so called fitness function, which is evaluated upon the initial population as well as upon the solutions after successive generations. In this paper, we present the design of a GA and its implementation on state-of-the-art FPGAs. Our approach optimizes significantly more fitness functions than any other proposed solution. Several experiments on a platform with a Virtex-II Pro FPGA have been conducted. Implementations on a number of different high-end FPGAs outperforms other reconfigurable systems with a speedup ranging from 1.2x to 96.5x. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • OpenMP extensions for FPGA accelerators

    Publication Year: 2009 , Page(s): 17 - 24
    Cited by:  Papers (11)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (888 KB) |  | HTML iconHTML  

    Reconfigurable computing is one of the paths to explore towards low-power supercomputing. However, programming these reconfigurable devices is not an easy task and still requires significant research and development efforts to make it really productive. In addition, the use of these devices as accelerators in multicore, SMPs and ccNUMA architectures adds an additional level of programming complexity in order to specify the offloading of tasks to reconfigurable devices and the interoperability with current shared-memory programming paradigms such as openMP. This paper presents extensions to openMP 3.0 that try to address this second challenge and an implementation in a prototype runtime system. With these extensions the programmer can easily express the offloading of an already existing reconfigurable binary code (bitstream) hiding all the complexities related with device configuration, bitstream loading, data arrangement and movement to the device memory. Our current prototype implementation targets the SGI Altix systems with RASC blades (based on the Virtex 4 FPGA). We analyze the overheads introduced in this implementation and propose a hybrid host/device operational mode to hide some of these overheads, significantly improving the performance of the applications. A complete evaluation of the system is done with a matrix multiplication kernel, including an estimation considering different FPGA frequencies. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • High-level synthesis for the design of FPGA-based signal processing systems

    Publication Year: 2009 , Page(s): 25 - 32
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (997 KB) |  | HTML iconHTML  

    High-level synthesis (HLS) currently seems to be an interesting process to reduce the design time substantially. HLS tools actually map algorithms to architectures. While such tools were developed targeting ASIC technologies, HLS currently draws wide interest for FPGA designers. However with most of HLS techniques, traditional resource sharing models are very inaccurate for FPGAs: for example, multiplexers can be very expensive with such technologies. Resource usage optimizations and dedicated resource binding have to be applied. In this paper a HLS process which takes care of data-width and combines scheduling and binding to carefully take into account interconnect cost is presented. Experimental results show that our approach achieves significant reduction for area (34%) and dynamic power (28%) compared to a traditional synthesis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Instruction scheduling for VLIW processors under variation scenario

    Publication Year: 2009 , Page(s): 33 - 40
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1808 KB) |  | HTML iconHTML  

    Process variations in components like adders, multipliers, etc., of different integer functional units (IFUs) in VLIW (very long instruction word) processors may cause these units to operate at various speeds, resulting in non-uniform latency IFUs. Worst-case techniques to deal with the non-uniform latency IFUs may incur significant performance and/or leakage energy loss. In this work, we propose two process variation-aware compile time techniques to handle non-uniform latency IFUs. In the first technique, namely `turn-off', we turn off all the process variation affected high latency IFUs. In the second technique, namely `on-demand turn-on', we use some of the process variation affected high latency IFUs by turning them on whenever there is a requirement. Our experimental results show that with these techniques, the non-uniform latency IFU can be tackled without much performance penalty. The proposed techniques also achieve significant reduction in leakage energy consumption because of turning off of some of the IFUs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A physical-level study of the compacted matrix instruction scheduler for dynamically-scheduled superscalar processors

    Publication Year: 2009 , Page(s): 41 - 48
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (11818 KB) |  | HTML iconHTML  

    This work studies physical-level characteristics of the recently proposed compacted matrix instruction scheduler for dynamically-scheduled, superscalar processors. Previous work focused on the matrix scheduler's architecture and argued in support of its speed and scalability advantages. However, no physical-level implementation or models were reported for it. Using full-custom layouts in a commercial 90 nm fabrication technology, this work investigates the latency and energy variations of the compacted matrix and its accompanying logic as a function of the issue width, the window size, and the number of global recovery checkpoints. This work also proposes an energy optimization that throttles unnecessary pre-charges and evaluations. This optimization reduces energy by 10% and 18% depending on the scheduler size. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Instruction-based reuse-distance prediction for effective cache management

    Publication Year: 2009 , Page(s): 49 - 58
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1037 KB) |  | HTML iconHTML  

    The effect of caching is fully determined by the program locality or the data reuse and several cache management techniques try to base their decisions on the prediction of temporal locality in programs. However, prior work reports only rough techniques which either try to predict when a cache block loses its temporal locality or try to categorize cache items as highly or poorly temporal. In this work, we quantify the temporal characteristics of the cache block at run time by predicting the cache block reuse distances (measured in intervening cache accesses), based on the access patterns of the instructions (PCs) that touch the cache blocks. We show that an instruction-based reused distance predictor is very accurate and allows approximation of optimal replacement decisions, since we can ldquoseerdquo the future. We experimentally evaluate our prediction scheme in various sizes L2 caches using a subset of the most memory intensive SPEC2000 benchmarks. Our proposal obtains a significant improvement in terms of IPC over traditional LRU up to 130.6% (17.2% on average) and it also outperforms the previous state of the art proposal (namely Dynamic Insertion Policy or DIP) by up to 80.7% (15.8% on average). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Adaptive simulation sampling using an Autoregressive framework

    Publication Year: 2009 , Page(s): 59 - 66
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (679 KB) |  | HTML iconHTML  

    Software simulators remain several orders of magnitude slower than the modern microprocessor architectures they simulate. Although various reduced-time simulation tools are available to accurately help pick truncated benchmark simulation, they either come with a need for offline analysis of the benchmarks initially or require many iterative runs of the benchmark. In this paper, we present a novel sampling simulation method, which only requires a single run of the benchmark to achieve a desired confidence interval, with no offline analysis and gives comparable results in accuracy and sample sizes to current simulation methodologies. Our method is a novel configuration independent approach that incorporates an Autoregressive (AR) model using the squared coefficient of variance (SCV) of Cycles per Instruction (CPI). Using the sampled SCVs of past intervals of a benchmark, the model computes the required number of samples for the next interval through a derived relationship between number of samples and the SCVs of the CPI distribution. Our implementation of the AR model achieves an actual average error of only 0.76% on CPI with a 99.7% confidence interval of plusmn0.3% for all SPEC2K benchmarks while simulating, in detail, an average of 40 million instructions per benchmark. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An emulation-based real-time power profiling unit for embedded software

    Publication Year: 2009 , Page(s): 67 - 73
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (834 KB) |  | HTML iconHTML  

    The power consumption of battery-powered and energy-scavenging devices has become a major design metric for embedded systems. Increasingly complex software applications as well as rising demands in operating times while having restricted power budgets make power-aware system design indispensable. In this paper we present an emulation-based power profiling approach allowing for real-time power analysis of embedded systems. Power saving potential as well as power-critical events can be identified in much less time compared to power simulations. Hence, the designer can take countermeasures already in early design stages, which enhances development efficiency and decreases time-to-market. Accuracies achieved for a deep submicron smart-card controller are greater than 90% compared to gate-level simulations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A timed HW/SW coemulation technique for fast yet accurate system verification

    Publication Year: 2009 , Page(s): 74 - 81
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (620 KB) |  | HTML iconHTML  

    In system-on-chip (SoC) design, it is essential to verify the correctness of design before a chip is fabricated. While conventional hardware emulators validate functional correctness of hardware components quickly, only a few researches exist to use hardware emulators for timing verification since synchronization between the hardware emulator and the other parts easily overwhelms the gain of hardware emulator. In this paper we propose a novel hardware/software coemulation framework for fast yet accurate system verification based on the virtual synchronization technique. For virtual synchronization, interface protocol and interface logic between a hardware emulator and the HW/SW coemulation kernel are proposed. Experiments with real-life examples prove the effectiveness of the proposed technique. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • RETHROTTLE: Execution throttling in the REDEFINE SoC architecture

    Publication Year: 2009 , Page(s): 82 - 91
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1178 KB) |  | HTML iconHTML  

    REDEFINE is a reconfigurable SoC architecture that provides a unique platform for high performance and low power computing by exploiting the synergistic interaction between coarse grain dynamic dataflow model of computation (to expose abundant parallelism in applications) and runtime composition of efficient compute structures (on the reconfigurable computation resources). We propose and study the throttling of execution in REDEFINE to maximize the architecture efficiency. A feature specific fast hybrid (mixed level) simulation framework for early in design phase study is developed and implemented to make the huge design space exploration practical. We do performance modeling in terms of selection of important performance criteria, ranking of the explored throttling schemes and investigate effectiveness of the design space exploration using statistical hypothesis testing. We find throttling schemes which give appreciable (24.8%) overall performance gain in the architecture and 37% resource usage gain in the throttling unit simultaneously. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Generation and calibration of compositional performance analysis models for multi-processor systems

    Publication Year: 2009 , Page(s): 92 - 99
    Cited by:  Papers (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (655 KB) |  | HTML iconHTML  

    The performance analysis of heterogeneous multi-processor systems is becoming increasingly difficult due to the steadily growing complexity of software and hardware components. To cope with these increasing requirements, analytic methods have been proposed. The automatic generation of analytic system models that faithfully represent real system implementations has received relatively little attention, however. In this paper, an approach is presented in which an analytic system model is automatically generated from the same specification that is also used for system synthesis. Analytic methods for performance analysis of a system can thus be seamlessly integrated into the multi-processor design flow which lays a sound foundation for designing systems with a predictable performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance evaluation of concurrently executing parallel applications on multi-processor systems

    Publication Year: 2009 , Page(s): 100 - 107
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (381 KB) |  | HTML iconHTML  

    Multiprocessors are increasingly being used in modern embedded systems for reasons of power and speed. These systems have to support a large number of applications and standards, in different combinations, called use-cases. The key challenges are designing efficient systems handling all these use-cases; this requires fast exploration of software and hardware alternatives with accurate performance evaluation. In this paper, we present a system level FPGA based simulation methodology for performance evaluation of applications on multiprocessor platforms. We observe that for multiple applications sharing an MPSoC platform, dynamic arbitration can cause deadlock in simulation. We use conservative parallel discrete event simulation (PDES) for simulation of these use-cases. We further note that conservative PDES is inefficient so we present a new PDES methodology that avoids causality errors by detecting them in advance. We call our new approach as smart conservative PDES. It is scalable in the number of use-cases and number of simulated processors and is 15% faster than conservative PDES. We further present results of a case-study of two real life applications. We used our simulation technique to do a design space exploration for optimal buffer space for JPEG and H263 decoders. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Manycore performance analysis using timed configuration graphs

    Publication Year: 2009 , Page(s): 108 - 117
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (683 KB) |  | HTML iconHTML  

    The programming complexity of increasingly parallel processors calls for new tools to assist programmers in utilising the parallel hardware resources. In this paper we present a set of models that we have developed to form part of a tool which is intended for iteratively tuning the mapping of dataflow graphs onto manycores. One of the models is used for capturing the essentials of manycores that are identified as suitable for signal processing and which we use as target architectures. Another model is the intermediate representation in the form of a timed configuration graph, describing the mapping of a dataflow graph onto a machine model. Moreover, this IR can be used for performance evaluation using abstract interpretation. We demonstrate how the models can be configured and applied in order to map applications on the Raw processor. Furthermore, we report promising results on the accuracy of performance predictions produced by our tool. It is also demonstrated that the tool can be used to rank different mappings with respect to optimisation on throughput and end-to-end latency. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multi-processor system-on-chip Design Space Exploration based on multi-level modeling techniques

    Publication Year: 2009 , Page(s): 118 - 124
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (725 KB) |  | HTML iconHTML  

    Multi-processor Systems-on-chip are currently designed by using platform-based synthesis techniques. In this approach, a wide range of platform parameters are tuned to find the best trade-offs in terms of the selected system figures of merit (such as energy, delay and area). This optimization phase is called design space exploration (DSE) and it generally consists of a multi-objective optimization (MOO) problem. The design space of a multi-processor architecture is too large to be evaluated comprehensively. So far, several heuristic techniques have been proposed to address the MOO problem, but they are characterized by low efficiency to identify the Pareto set. In this paper we propose a methodology for heuristic platform based design based on evolutionary algorithms and multi-level simulation techniques. In particular, we extend the NSGA-II with an approximate neural network meta-model for multiprocessor architectures in order to replace expensive platform simulations with fast meta-model evaluation. The model accuracy and efficiency is improved by exploiting high-level platform simulation techniques. High-level simulation allows us to reduce the overall complexity of the neural network and improving its prediction power. Experimental results show that the proposed techniques is able to reduce the number of simulations needed for the optimization without decreasing the quality of the obtained Pareto set. Results are compared with state of the art techniques to demonstrate that optimization time due to simulation can be sped up by adopting multi-level simulation techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hardware-based synchronization framework for heterogeneous RISC/Coprocessor architectures

    Publication Year: 2009 , Page(s): 125 - 132
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (720 KB) |  | HTML iconHTML  

    This paper presents a synchronization framework for parallel computing heterogeneous processing elements, which are controlled by a RISC processor. The communication delay between RISC and processing elements is a key issue if the RISC is not closely attached to the processing elements. Recent synchronization approaches neglect communication delays or require low communication delays. This results in a low synchronization rate between RISC and PEs. In order to overcome this delay, a special hardware-based synchronization approach is proposed that reduces the communication overhead and increases the number of executable tasks per time unit. Further, it supports parallel execution of independent hardware tasks. The approach was evaluated for a modular coprocessor architecture containing several processing elements for image processing tasks. The coarse-grained parallel execution of independent tasks significantly improves the speed of an exemplary application for aerial image based vehicle detection on straight highways. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.