By Topic

Embedded Computer Systems (SAMOS), 2010 International Conference on

Date 19-22 July 2010

Filter Results

Displaying Results 1 - 25 of 60
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (35 KB)  
    Freely Available from IEEE
  • [Title page]

    Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (45 KB)  
    Freely Available from IEEE
  • [Copyright notice]

    Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (66 KB)  
    Freely Available from IEEE
  • Preface

    Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (17 KB)  
    Freely Available from IEEE
  • The 10 years of SAMOS

    Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (13 KB)  
    Freely Available from IEEE
  • IC-SAMOS organization

    Page(s): 1 - 3
    Save to Project icon | Request Permissions | PDF file iconPDF (26 KB)  
    Freely Available from IEEE
  • list-reviewer

    Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (20 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): 1 - 4
    Save to Project icon | Request Permissions | PDF file iconPDF (121 KB)  
    Freely Available from IEEE
  • In memoriam Stamatis Vassiliadis (1951-2007)

    Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (64 KB)  
    Freely Available from IEEE
  • Technologies for reducing power

    Page(s): i
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (555 KB) |  | HTML iconHTML  

    With power and cooling becoming an increasingly costly part of the operating cost of a server, the old trend of striving for higher performance with little regard for power is over. Emerging semiconductor process technologies, multicore architectures, and new interconnect technology provide an avenue for future servers to become low power, compact, and possibly mobile. In our talk we examine three techniques for achieving low power: 1) Near threshold operation; 2) 3D die stacking; and 3) replacing DRAM with Flash memory. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • VLSI challenges to more energy efficient devices

    Page(s): ii
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (516 KB)  

    Delivering power efficiently to advanced technology VLSI is a top challenge and priority for mobile platforms. In addition, nearly 50% of the real estate on a typical mobile platform, such as a handheld, is used to convert power to the low voltages required by the advanced technology. This is so because as voltage continues to scale down with technology, the tolerances required by these circuits for reliable and energy efficient operation continues to tighten. In this presentation we review the state of the art and show a large gap between today's VLSI technology and the trends in form factor, energy efficiency regulatory specifications, and battery life requirements. We will also show how integration of the power delivery circuitry presents an opportunity for a non-linear improvements in energy efficiency of mobile devices. We will demonstrate the value of integration with models and experimentally with testchips. Finally, We will present our view of the required advancements in circuit technology, VLSI testing technology, and process technology needed to achieve such integration. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Cycle-accurate performance modelling in an ultra-fast just-in-time dynamic binary translation instruction set simulator

    Page(s): 1 - 10
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (975 KB) |  | HTML iconHTML  

    Instruction set simulators (ISS) are vital tools for compiler and processor architecture design space exploration and verification. State-of-the-art simulators using just-in-time (JIT) dynamic binary translation (DBT) techniques are able to simulate complex embedded processors at speeds above 500 MIPS. However, these functional ISS do not provide microarchitectural observability. In contrast, low-level cycle-accurate ISS are too slow to simulate full-scale applications, forcing developers to revert to FPGA-based simulations. In this paper we demonstrate that it is possible to run ultra-high speed cycle-accurate instruction set simulations surpassing FPGA-based simulation speeds. We extend the JIT DBT engine of our ISS and augment JIT generated code with a verified cycle-accurate processor model. Our approach can model any microarchitectural configuration, does not rely on prior profiling, instrumentation, or compilation, and works for all binaries targeting a state-of-the-art embedded processor implementing the ARCompact™ instruction set architecture (ISA). We achieve simulation speeds up to 63 MIPS on a standard ×86 desktop computer, whilst the average cycle-count deviation is less than 1.5% for the industry standard EEMBC and COREMARK benchmark suites. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A trace-based scenario database for high-level simulation of multimedia MP-SoCs

    Page(s): 11 - 19
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (538 KB) |  | HTML iconHTML  

    High-level simulation and design space exploration nowadays are key ingredients for system-level design of modern multimedia embedded systems. The majority of the work in this area evaluates systems under a single, fixed application workload. In reality, however, the application workload in such systems (i.e., the applications that are concurrently executing and contending for system resources), and therefore the intensity and nature of the application demands, can change dramatically over time. To facilitate the simulation and exploration of different workload scenarios, this paper presents the concept of a so-called scenario database, which has been integrated in our Sesame system-level simulation framework. This scenario database compactly stores application scenarios and allows for generating application workloads - in the form of event traces - belonging to the stored scenarios for the purpose of scenario-aware simulation in Sesame. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A library of dual-clock FIFOs for cost-effective and flexible MPSoC design

    Page(s): 20 - 27
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (605 KB) |  | HTML iconHTML  

    Customization of IP blocks in a multi-processor system-on-chip (MPSoC) is the historical approach to the cost-effective implementation of such systems. A recent trend consists of structuring a MPSoC into loosely coupled voltage and frequency islands to meet tight power budgets. In this context, synchronization between islands of synchronicity becomes a major design issue. Dual-clock FIFOs compare favorably with respect to synchronizer-based designs and pausible clocking interfaces from a performance viewpoint, but incur a significant area, power and latency overhead. This paper proposes a library of dual-clock FIFOs for cost-effective MPSoC design, where each architecture variant in the library has been designed to match well-defined operating conditions at the minimum implementation cost. Each FIFO synchronizer is suitable for plug-and-play insertion into the NoC architecture and selection depends on the performance requirements of the synchronization interface at hand. Above all, components of our synchronization library have not been conceived in isolation, but have been tightly co-designed with the switching fabric of the on-chip interconnection network, thus making a conscious use of power-hungry buffering resources and leading to affordable implementations in the resource constrained MPSoC domain. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Transparent sampling

    Page(s): 28 - 36
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (492 KB) |  | HTML iconHTML  

    Low simulation speeds have a critical impact on the design process by limiting the number of design options which can be explored. Sampling is a popular fast simulation technique because it can achieve high simulation speed and high accuracy. However state-of-the-art sampling techniques either consider warm-up as an orthogonal issue and leave the choice of a warm-up technique to the end user, or require cumbersome simulator modifications from the end user. Since the most user-friendly and efficient warm-up techniques are not easily compatible with the most efficient sampling techniques, the end user is left with a difficult choice, or runs the risk of misusing sampling techniques with poor warm-up. Transparent Sampling reconciles sampling and warm-up techniques by delivering state-of-the-art accuracy and simulation time, while remaining easily accessible to end users not proficient in, or not willing to delve into, fast simulation issues. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design of a flexible high-speed FPGA-based flow monitor for next generation networks

    Page(s): 37 - 44
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (444 KB) |  | HTML iconHTML  

    The evolution of new services and the development of next generation networks is placing severe demands on networks. In order to support emerging services, it is clear that a higher degree of monitoring functionality is needed within networks. One approach is to use the programmability and performance of Field Programmable Gate Array (FPGA) technology to allow distributed monitoring but this present challenges around memory usage and highlights the need for a strategically different approach to how flows are monitored. A novel FPGA-based, programmable IP flow monitor is presented, that permits various forms of network monitoring functionality, specifically monitoring of different classes of traffic, to be performed. A key aspect was to address the traditionally long design times, through the use of a new, experimental, high-level design flow called Packet Xpress which allowed functions to be quickly implemented, providing a twentyfold speed-up in design time. The monitor has been implemented on a Xilinx Virtex-5 based ML506 board and verified in hardware using real Internet traffic traces. The platform allowed us to experiment with various monitoring functions and establish their resource requirements. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A fully programmable FSM-based Processing Engine for Gigabytes/s header parsing

    Page(s): 45 - 54
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1025 KB) |  | HTML iconHTML  

    In this paper we discuss a new architecture, which is deployed for multi-standard packet inspection and basic network processing tasks in a high-performance network coprocessor. Thereby, concepts, architecture, compiler tool-chain and VLSI area estimation for this programmable finite state machine based (FSM-based) Processing Engine, FPE, are presented. The microarchitecture comprises an FSM-controlled instruction sequencing mechanism, a novel register organization scheme and a short pipeline instead of a typical multi-staged processor pipeline. This introduces several advantages for efficient handling of conditional branches and small look-ups. Those advantages can be utilized for packet classification applications. The FPE data path performance is compared to an ARM9-type processor in two exemplary header parsing kernels from the ”CommBench” benchmark suite. According to the results, the presented engine provides a speed-up of 4 to 10 in terms of required computation cycles to the ARM9. Using a 65 nm VLSI technology, the FPE design is supposed to run at clock frequencies up to 2 GHz and requires about 1.8 mm2 chip area. Based on the specific transition rule memory organization, which is an essential element of the programmable FSM, a memory utilization of around 95% can be achieved. However, the FPE micro-architecture requires a customized code translation chain in order to transfer high-level program code into an FSM representation. Basically, this is achieved by three steps: (1) generation of sequential, assembly-like macro-instructions, (2) scheduling and generation of FSM-based horizontal (parallel) micro-code and (3) organization of respective FSM rules in the ”instruction” memory. Our studies confirm the advantages of the FPE as a fully programmable high-performance header parsing engine. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Empirical evaluation of data transformations for network infrastructure applications

    Page(s): 55 - 62
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (748 KB) |  | HTML iconHTML  

    It is estimated that the amount of data coming out of an optical fibre is doubling every nine months and, thus, the growth rate in network bandwidth by far exceeds that of transistor density stated by Moore's law. This causes excessive strain on network infrastructure nodes such as routers which need to operate at line rate in order to keep up with the external bandwidth requirements. Consequently, manufacturers of network processors have developed a wide range of technologies including highly parallel and specialised architectures to cope with ever increasing processing demands. Software tool support, however, lags behind and most research in compiling for network processors has focused on improved sequential and parallel code generation. In this paper we show that not code, but data organisation is the key obstacle to overcome in order to achieve high performance on network infrastructure applications. We evaluate three specialised data transformations (structure splitting, array regrouping, and software caching) against the industrial EEMBC networking benchmarks and real-world data sets. We demonstrate that speedups of up to 2.62 can be achieved, but at the same time no single solution performs equally well across all network traffic scenarios. This clearly indicates that adaptive data transformation schemes are necessary to ensure optimal performance under varying network loads. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design environment for the support of configurable Network Interfaces in NoC-based platforms

    Page(s): 63 - 70
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (678 KB) |  | HTML iconHTML  

    IP-based platforms with Network on Chip (NoC) are one solution to support complex telecommunication applications. In this context, NoC architectures targeting high throughput applications tend to have configurable Network Interfaces (NI) and routers for reuse and performance purposes and aims at providing advanced communication and computation services. Unfortunately, these Network Interfaces are increasingly complex to parameterize and to program, while the deployment tools taking into account the low level architectural details are still non existent. This work focuses on providing methods and tools to easily and efficiently deploy applications on IP and NoC based platform with configurable NI. Configurable NI offer primitives to synchronize and schedule the communication and the behaviour of IPs. Our code generation flow takes as inputs an abstract model of the HW platform, of the application and of the mapping, and generates most of the required configurations. The efficiency of the approach is illustrated by the deployment of a complex 4G telecommunication application on a heterogeneous IP-based platform. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient realization of forward integer transform in H.264/AVC intra-frame encoder

    Page(s): 71 - 78
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (394 KB) |  | HTML iconHTML  

    The H.264/AVC intra-only frame encoder, for its excellent encoding performance, is well-suited for image/video compression applications such as Digital Still Camera (DSC), Digital Video Camera (DVC), Television Studio Broadcast and Surveillance video. The forward integer transform is an integral part of the H.264/AVC video encoder. In this paper, for image compression applications running on battery-powered electronic devices (such as DSC), we propose a low-power, area-efficient realization of the forward integer transform. The proposed solution reduces the number of operations by more than 50% (30 vs. 64) and consumes significantly less dynamic power when compared with existing state-of-the-art designs for the forward integer transform. For video compression applications such as Television Studio Broadcast or Surveillance Videos, where throughput is more important, we propose a low-latency, area-efficient realization of the forward integer transform unit in the intra frame processing chain. With the proposed solution, the effective latency for forward integer transform is drastically reduced, as the processing unit is no longer on the critical path of the intra-frame processing chain. Moreover, the proposed solution requires half the numbers of operations for its hardware implementation, when compared with existing state-of-the-art designs for forward integer transform. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • SIMD performance in software based mobile video coding

    Page(s): 79 - 85
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (416 KB) |  | HTML iconHTML  

    Most video applications use specific application programming interfaces to achieve the desired functionalities. Implementing interface backends with hardware is often too expensive for low-end mobile devices, so most of the devices cope with highly optimized software implementations that employ special instruction sets. The most common approach is the utilization of SIMD processing units such as ARM NEON or Intel WMMX in mobile application processors. Fully utilizing the potential benefits of such instruction sets usually means tedious assembly coding even if vectorizing compilers have improved lately. In addition, low level APIs such as OpenMax DL have been made available to offer a standardized interface for accelerated codec functionalities. In this paper we present optimization methods and results from using a NEON instruction set and OpenMax DL API for MPEG-4 and H.264 video encoding and decoding. Although these technologies provide for significant speed-ups and reduce the burden of application designers, the serial bit stream processing bottleneck remains to be solved. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast Huffman decoding by exploiting data level parallelism

    Page(s): 86 - 92
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (357 KB) |  | HTML iconHTML  

    The frame rates and resolutions of digital videos are on the rising edge. Thereby, pushing the compression ratios of video coding standards to their limits, resulting in more complex and computational power hungry algorithms. Programmable solutions are gaining interest to keep up the pace of the evolving video coding standards, by reducing the time-to-market of upcoming video products. However, to compete with hardwired solutions, parallelism needs to be exploited on as many levels as possible. In this paper the focus will be on data level parallelism. Huffman coding is proven to be very efficient and therefore commonly applied in many coding standards. However, due to the inherently sequential nature, parallelization of the Huffman decoding is considered hard. The proposed fully flexible and programmable acceleration exploits available data level parallelism in Huffman decoding. Our implementation achieves a decoding speed of 106 MBit/s while running on a 250 MHz processor. This is a speed-up of 24× compared to our sequential reference implementation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Real-time stereo vision system using semi-global matching disparity estimation: Architecture and FPGA-implementation

    Page(s): 93 - 101
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2068 KB) |  | HTML iconHTML  

    This paper describes a new architecture and the corresponding implementation of a stereo vision system that covers the entire stereo vision process including noise reduction, rectification, disparity estimation, and visualization. Dense disparity estimation is performed using the non-parametric rank transform and semi-global matching (SGM), which is among the top performing stereo matching methods and outperforms locally-based methods in terms of quality of disparity maps and robustness under difficult imaging conditions. Stream-based processing of the SGM despite its non-scan-aligned, complex data dependencies is achieved by a scalable, systolic-array-based architecture. This architecture fulfills the demands of real-world applications regarding frame rate, depth resolution and low resource usage. The architecture is based on a novel two-dimensional parallelization concept for the SGM. An FPGA implementation on a Xilinx Virtex-5 generates disparity maps of VGA images (640×480 pixel) with a 128 pixel disparity range under real-time conditions (30 fps) at a clock frequency as low as 39 MHz. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Custom multi-threaded Dynamic Memory Management for Multiprocessor System-on-Chip platforms

    Page(s): 102 - 109
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (677 KB) |  | HTML iconHTML  

    We address the problem of custom Dynamic Memory Management (DMM) in Multi-Processor System-on-Chip (MPSoC) architectures. Customization is enabled through the definition of a design space that captures in a global, modular and parameterized manner the primitive building blocks of multi-threaded DMM. A systematic exploration methodology is proposed to efficiently traverse the design space. Customized Pareto DMM configurations are automatically generated through the development of software tools implementing the proposed methodology. Experimental evaluation based on a real-life multithreaded dynamic network application show that the proposed methodology delivers higher quality (application-specific) solutions in comparison with state-of-the-art dynamic memory managers together with 62% exploration runtime reductions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Power aware heterogeneous MPSoC with dynamic task scheduling and increased data locality for multiple applications

    Page(s): 110 - 117
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (973 KB) |  | HTML iconHTML  

    A new heterogeneous multiprocessor system with dynamic memory and power management for improved performance and power consumption is presented. Increased data locality is automatically revealed leading to enhanced memory access capabilities. Several applications can run in parallel sharing processing elements, memories as well as the interconnection network. Real time constraints are regarded by prioritization of processing element allocation, scheduling and data transfers. Scheduling and allocation is done dynamically according to runtime data dependency checking. We are able to show that execution times, bandwidth demands and power consumption are decreased. A tool flow is introduced for an easy generation of the hardware platform and software binaries for cycle accurate simulations. Further newly developed tools are available for power analysis, data transfer observation and task execution visualization. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.