By Topic

Embedded Computer Systems (SAMOS), 2012 International Conference on

Date 16-19 July 2012

Filter Results

Displaying Results 1 - 25 of 62
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (80 KB)  
    Freely Available from IEEE
  • [Title page]

    Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (48 KB)  
    Freely Available from IEEE
  • Proceedings 2012 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS) [Copyright notice]

    Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (84 KB)  
    Freely Available from IEEE
  • IC-SAMOS organization

    Page(s): 1 - 4
    Save to Project icon | Request Permissions | PDF file iconPDF (69 KB)  
    Freely Available from IEEE
  • Preface

    Page(s): 1
    Save to Project icon | Request Permissions | PDF file iconPDF (68 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): 1 - 5
    Save to Project icon | Request Permissions | PDF file iconPDF (123 KB)  
    Freely Available from IEEE
  • The homogeneity of architecture in a heterogeneous world

    Page(s): i
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (99 KB) |  | HTML iconHTML  

    Summary form only given. It has long been accepted within embedded computing that using a heterogeneous core focused to a specific task can deliver improved performance and subsequent improved power efficiency. The challenge has always been how can the software programmer integrate this hardware diversity as workloads become generalized or often unknown at design time? Using library abstraction permits specific tasks to benefit from heterogeneity, but how can general purpose code benefit? This talk will describe the hardware and software techniques currently being developed in next generation ARM based SoC to address the challenge of maintaining the homogeneity of the software architecture while extending to the benefits of heterogeneity in hardware. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • It's about time

    Page(s): ii
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (599 KB)  

    Summary form only given. All widely used software abstractions lack temporal semantics. The notion of correct execution of a program written in every widely-used programming language and in nearly every processor instruction-set today does not depend on the timing of the execution. Computer architects exploit the fact that timing is irrelevant to correctness with aggressive performance-enhancing techniques such as speculative execution, branch prediction, dynamic dispatch, cache memories, virtual memory, etc. While these techniques improve average case performance, they do so at the expense of controllability, repeatability, and predictability of timing. But temporal behavior matters in almost all systems, but most particularly in networked embedded systems. Even in systems with no particular real-time requirements, timing of programs is relevant to the value delivered by programs, and in the case of concurrent and distributed programs, also affects the functionality. In systems with real-time requirements, including most embedded systems, temporal behavior affects not just the value delivered by a system but also its correctness. This talk will argue that time can and must become part of the semantics of programs and computer architectures. To illustrate that this is both practical and useful, we will describe recent efforts at Berkeley in the design and analysis of timing-centric software systems. In particular, we will focus on two projects, PRET, which seeks to provide computing platforms with repeatable timing, and PTIDES, which provides a programming model for distributed real-time systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Maximum performance computing for exascale applications

    Page(s): iii
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (602 KB)  

    Summary form only given. Ever since Fermi, Pasta and Ulam conducted the first fundamentally important numerical experiments in 1953, science has been driven by the progress of available computational capability. In particular, computational quantum chemistry and computational quantum physics depend on ever increasing amounts of computation. However, due to power density limitations at the chip we have seen the end of single CPU performance scaling. Now the challenge is to improve compute performance through some form of parallel processing without incurring power limits at the system level. One way to deal with the system “power wall” question is to ask “what is the maximum amount of computation that can be achieved within a certain power budget”. We argue that such Maximum Performance Computing needs to focus on end-to-end execution time of complete scientific applications and needs to include a multi-disciplinary approach, bringing together scientists and engineers to optimize the whole process from mathematics and algorithms all the way down to arithmetic and number representation. We have done a number of such multidisciplinary studies with our customers (Chevron, Schlumberger, and JP Morgan). Our current results with Maxeler Dataflow Engines for production PDE solver applications in Earth Sciences and Finance show an improvement of 20-40x in Speed and/or Watts per application run. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Just-in-Time Verification in ADL-based processor design

    Page(s): 1 - 6
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (674 KB) |  | HTML iconHTML  

    A novel verification methodology, combining the two new techniques of Live Verification and Processor State Transfer, is introduced to Architecture Description Language (ADL) based processor design. The proposed Just-in-Time Verification significantly accelerates the simulation-based equivalence check of the register-transfer and instruction-set level models, generated from the ADL-based specification. This is accomplished by omitting redundant simulation steps occurring in the conventional architecture debug cycle. The potential speedup is demonstrated with a case study, achieving an acceleration of the debug cycle by 660x. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Interleaving methods for hybrid system-level MPSoC design space exploration

    Page(s): 7 - 14
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (555 KB) |  | HTML iconHTML  

    System-level design space exploration (DSE), which is performed early in the design process, is of eminent importance to the design of complex multi-processor embedded system architectures. During system-level DSE, system parameters like, e.g., the number and type of processors, the type and size of memories, or the mapping of application tasks to architectural resources, are considered. Simulation-based DSE, in which different design instances are evaluated using system-level simulations, typically are computationally costly. Even using high-level simulations and efficient exploration algorithms, the simulation time to evaluate design points forms a real bottleneck in such DSE. Therefore, the vast design space that needs to be searched requires effective design space pruning techniques. This paper presents and studies different strategies for interleaving fast but less accurate analytical performance estimations with slower but more accurate simulations during DSE. By interleaving these analytical estimations with simulations, our hybrid approach significantly reduces the number of simulations that are needed during the process of DSE. Experimental results have demonstrated that such hybrid DSE is a promising technique that can yield solutions of similar quality as compared to simulation-based DSE but only at a fraction of the execution time. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A template-based methodology for efficient microprocessor and FPGA accelerator co-design

    Page(s): 15 - 22
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1082 KB) |  | HTML iconHTML  

    Embedded applications usually require Software/Hardware (SW/HW) designs to meet the hard timing constraints and the required design flexibility. Exhaustive exploration for SW/HW designs is a very time consuming task, while the adhoc approaches and the use of partially automatic tools usually lead to less efficient designs. To support a more efficient codesign process for FPGA platforms we propose a systematic methodology to map an application to SW/HW platform with a custom HW accelerator and a microprocessor core. The methodology mapping steps are expressed through parametric templates for the SW/HW Communication Organization, the Foreground (FG) Memory Management and the Data Path (DP) Mapping. Several performance-area tradeoff design Pareto points are produced by instantiating the templates. A real-time bioimaging application is mapped on a FPGA to evaluate the gains of our approach, i.e. 44,8% on performance compared with pure SW designs and 58% on area compared with pure HW designs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Using OpenMP superscalar for parallelization of embedded and consumer applications

    Page(s): 23 - 32
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (754 KB) |  | HTML iconHTML  

    In the past years, research and industry have introduced several parallel programming models to simplify the development of parallel applications. A popular class among these models are task-based programming models which proclaim ease-of-use, portability, and high performance. A novel model in this class, OpenMP Superscalar, combines advanced features such as automated runtime dependency resolution, while maintaining simple pragma-based programming for C/C++. OpenMP Superscalar has proven to be effective in leveraging parallelism in HPC workloads. Embedded and consumer applications, however, are currently still mainly parallelized using traditional thread-based programming models. In this work, we investigate how effective OpenMP Superscalar is for embedded and consumer applications in terms of usability and performance. To determine the usability of OmpSs, we show in detail how to implement complex parallelization strategies such as ones used in parallel H.264 decoding. To evaluate the performance we created a collection of ten embedded and consumer benchmarks parallelized in both OmpSs and Pthreads. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Virtual prototyping for efficient multi-core ECU development of driver assistance systems

    Page(s): 33 - 40
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (845 KB) |  | HTML iconHTML  

    In recent years, road vehicles have experienced an enormous increase in driver assistance systems such as traffic sign recognition, lane departure warning, and pedestrian detection. Cost-efficient development of electronic control units (ECUs) for these systems is a complex challenge. The demand for shortened time to market makes the development even more challenging and thus demands efficient design flows. This paper proposes a model-based design flow that permits simulation-based performance evaluation of multi-core ECUs for driver assistance systems in a pre-development stage. The approach is based on a system-level virtual prototype of a multi-core ECU and allows the evaluation of timing effects by mapping application tasks to different platforms. The results show that performance estimation of different parallel implementation candidates is possible with high accuracy even in a pre-development stage. By adapting the best-fitting parallelization strategy to the final ECU, a reduction in the time to market period is possible. Currently, the design flow is being evaluated by Daimler AG and is being applied to a pedestrian detection system. Results from this application illustrate the benefits of the proposed approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • System modeling and multicore simulation using transactions

    Page(s): 41 - 50
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (797 KB) |  | HTML iconHTML  

    With the increasing complexity of digital systems that are becoming more and more parallel, a better abstraction to describe such systems has become a necessity. This paper shows how, by using the powerful mechanism of transactions as a concurrency model, and by taking advantage of .NET introspection and attribute programming capabilities, we were able to develop a system-level modeling and parallel simulation environment. We kept the same concepts to describe the architecture of high-level models, such as modules and communication channels. However, unlike SystemC, the behaviour is no longer described as processes and events but as transactions. We implemented scheduling algorithms in order to enable simulating a transactional models in parallel by taking advantage of a multicore machine. These algorithms take into account the dependency between transactions and the number of cores of the simulation machine. We studied two synchronisation strategies: one using locking and the other using partitioning. An experiment made on a WiFi 802.11a transmitter achieved a speedup of about 1.9 using two threads. With 8 threads, although the workload of individual transactions was not significant, we could reach a 5.1 speedup. When the workload is significant the speedup can reach 6.3. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • HNOCS: Modular open-source simulator for Heterogeneous NoCs

    Page(s): 51 - 57
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1113 KB) |  | HTML iconHTML  

    We present HNOCS (Heterogeneous Network-on-Chip Simulator), an open-source NoC simulator based on OMNeT++. To the best of our knowledge, HNOCS is the first simulator to support modeling of heterogeneous NoCs with variable link capacities and number of VCs per unidirectional port. The HNOCS simulation platform provides an open-source, modular, scalable, extendible and fully parameterizable framework for modeling NoCs. It includes three types of NoC routers: synchronous, synchronous virtual output queue (VoQ) and asynchronous. HNOCS provides a rich set of statistical measurements at the flit and packet levels: end-to-end latencies, throughput, VC acquisition latencies, transfer latencies, etc. We describe the architecture, structure, available models and the features that make HNOCS suitable for advanced NoC exploration. We also evaluate several case studies which cannot be evaluated with any other exiting NoC simulator. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • BADCO: Behavioral Application-Dependent Superscalar Core model

    Page(s): 58 - 67
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (952 KB) |  | HTML iconHTML  

    Microarchitecture research and development rely heavily on simulators. The ideal simulator should be simple and easy to develop, it should be precise, accurate and very fast. But the ideal simulator does not exist, and microarchitects use different sorts of simulators at different stages of the development of a processor, depending on which is most important, accuracy or simulation speed. Approximate microarchitecture models, which trade accuracy for simulation speed, are very useful for research and design space exploration, provided the loss of accuracy remains acceptable. Behavioral superscalar core modeling is a possible way to trade accuracy for simulation speed in situations where the focus of the study is not the core itself. In this approach, a superscalar core is viewed as a black box emitting requests to the uncore at certain times. A behavioral core model can be connected to a cycle-accurate uncore model. Behavioral core models are built from cycle-accurate simulations. Once the time to build the model is amortized, important simulation speedups can be obtained. We describe and study a new method for defining behavioral models for modern superscalar cores. The proposed Behavioral Application-Dependent Superscalar Core model, BADCO, predicts the execution time of a thread running on a superscalar core with an error less than 10% in most cases. We show that BADCO is qualitatively accurate, being able to predict how performance changes when we change the uncore. The simulation speedups we obtained are typically between one and two orders of magnitude. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An application-specific Network-on-Chip for control architectures in RF transceivers

    Page(s): 68 - 75
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1275 KB) |  | HTML iconHTML  

    This paper focuses on the design of an on-chip communication system for control architectures used in RF (Radio Frequency) transceivers. Continuous developments and enhancements of RF transceivers, especially of smart transceivers supporting multi-mode standards, led to new and complex SoC (System-on-Chip) designs. These designs are defined by a distributed controlling concept using several processing modules which are connected over an advanced communication system. Based on the requirements and restrictions of this communication system an application-specific NoC (Network-on-Chip) is presented and analyzed in this work. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A framework for efficient cache resizing

    Page(s): 76 - 85
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1066 KB) |  | HTML iconHTML  

    We present a novel framework to dynamically reconfigure on-chip memory resources according to the changing behavior of the running applications. Our framework enables smooth scaling (i.e., resizing) of the on-chip caches targeting both performance and power efficiency. In contrast to previous approaches, the resizing decisions in our framework are not tainted by transient events (e.g., misses) that are due to downsizing avoiding at the same time swinging the cache size due to trial-and-error resizing decisions. This minimizes both execution time penalty induced by resizing as well as the effective cache size. Furthermore, an inherent property of our approach is that the actual invalidation of the cache blocks and the corresponding write-backs of the dirty blocks are asynchronous to resizing decisions, ensuring a smooth transition from one size to another. This makes it possible to apply our framework even on write-back caches. The proposed mechanism is simple to implement requiring minimal additional hardware. Using cycle-accurate simulations, we evaluate our proposal against previously proposed techniques. In all cases, our experimental results show significant benefits in both power and performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • OSR-Lite: Fast and deadlock-free NoC reconfiguration framework

    Page(s): 86 - 95
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1494 KB) |  | HTML iconHTML  

    Current and future on-chip networks will feature an enhanced degree of reconfigurability. Power management and virtualization strategies as well as the need to survive to the progressive onset of wear-out faults are root causes for that. In all these cases, a non-intrusive and efficient reconfiguration method is needed to allow the network to function uninterruptedly over the course of the reconfiguration process while remaining deadlock-free. This paper is inspired by the overlapped static reconfiguration (OSR) protocol developed for off-chip networks. However, in its native form its implementation in NoCs is out-of-reach. Therefore, we provide a careful engineering of the NoC switch architecture and of the system-level infrastructure to support a cost-effective, complete and transparent reconfiguration process. Performance during the reconfiguration process is not affected and implementation costs (critical path and area overhead) are proved to be fully affordable for a constrained system. Less than 250 cycles are needed for the reconfiguration process of an 8×8 2D mesh with marginal impact on system performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A tightly-coupled multi-core cluster with shared-memory HW accelerators

    Page(s): 96 - 103
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (939 KB) |  | HTML iconHTML  

    Tightly coupling hardware accelerators with processors is a well-known approach for boosting the efficiency of MPSoC platforms. The key design challenges in this area are: (i) streamlining accelerator definition and instantiation and (ii) developing architectural templates and run-time techniques for minimizing the cost of communication and synchronization between processors and accelerators. In this paper we present an architecture featuring tightly-coupled processors and hardware processing units (HWPU), with zero-copy communication. We also provide a simple programming API, which simplifies the process of offloading jobs to HWPUs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Architecture-level fault-tolerance for biomedical implants

    Page(s): 104 - 112
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1190 KB) |  | HTML iconHTML  

    In this paper, we describe the design and implementation of a new fault-tolerant RISC-processor architecture suitable for a design framework targeting biomedical implants. The design targets both soft and hard faults and is original in efficiently combining as well as enhancing classic fault-tolerance techniques. The proposed architecture allows run-time tradeoffs between performance and fault tolerance by means of instruction-level configurability. The system design is synthesized for UMC 90nm CMOS standard-process and is evaluated in terms of fault coverage, area, average power consumption, total energy consumption and performance for various duplication policies and test-sequence schedules. It is shown that area and power overheads of approximately 25% and 32%, respectively, are required to implement our techniques on the baseline processor. The major overheads of the proposed architecture are performance (up to 107%) and energy consumption (up to 157%). It is observed that the average power consumption is often reduced when a higher degree of fault tolerance is targeted. It is shown that test sequences can effectively be scheduled during the available program stalls and that nearly all soft faults are tolerated by using instruction duplication. The main advantages of the proposed architecture are the high portability of the introduced architecture-level fault-tolerance techniques, the flexibility in trading processor overheads for required fault-tolerance degree as well as affordable area and power consumption overheads. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reconfigurable miniature sensor nodes for condition monitoring

    Page(s): 113 - 119
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (877 KB) |  | HTML iconHTML  

    The wireless sensor networks are being deployed at escalating rate for various application fields. The ever growing number of application areas requires a diverse set of algorithms with disparate processing needs. The wireless sensor networks also need to adapt to the prevailing energy conditions and processing requirements. The preceding reasons rule out the use of a single fixed design. Instead a general purpose design that can rapidly adapt to different conditions and requirements is desired. In lieu of the traditional inflexible wireless sensor node consisting of a micro-controller, radio transceiver, sensor array and energy storage, we propose a rapidly reconfigurable miniature sensor node, implemented with a transport triggered architecture processor on a low-power Flash FPGA. Also power consumption and silicon area usage comparison between 16-bit fixed and floating point and 32-bit floating point implementations is presented in this paper. The implemented processors and algorithms are intended for rolling bearing condition monitoring, but can be fully extended for other applications as well. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Counting stream registers: An efficient and effective snoop filter architecture

    Page(s): 120 - 127
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1281 KB) |  | HTML iconHTML  

    We introduce a counting stream register snoop filter, which improves the performance of existing snoop filters based on stream registers. Over time, this class of snoop filters loses the ability to filter memory addresses that have been loaded, and then evicted, from the caches that are filtered; they include cache wrap detection logic, which resets the filter whenever the contents of the cache have been completely replaced. The counting stream register snoop filter introduced here replaces the cache wrap detection logic with a direct-mapped update unit and augments each stream register with a counter, which acts as a validity checker; loading new data into the cache increments the counter, while replacements, snoopy invalidations, and evictions decrement it. A cache wrap is detected whenever the counter reaches zero. Our experimental evaluation shows that the counting stream register snoop filter architecture improves the accuracy compared to traditional stream register snoop filters for representative embedded workloads. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design space exploration in application-specific hardware synthesis for multiple communicating nested loops

    Page(s): 128 - 135
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1243 KB) |  | HTML iconHTML  

    Application specific MPSoCs are often used to implement high-performance data-intensive applications. MPSoC design requires a rapid and efficient exploration of the hardware architecture possibilities to adequately orchestrate the data distribution and architecture of parallel MPSoC computing resources. Behavioral specifications of data-intensive applications are usually given in the form of a loop-based sequential code, which requires parallelization and task scheduling for an efficient MPSoC implementation. Existing approaches in application specific hardware synthesis, use loop transformations to efficiently parallelize single nested loops and use Synchronous Data Flows to statically schedule and balance the data production and consumption of multiple communicating loops. This creates a separation between data and task parallelism analyses, which can reduce the possibilities for throughput optimization in high-performance data-intensive applications. This paper proposes a method for a concurrent exploration of data and task parallelism when using loop transformations to optimize data transfer and storage mechanisms for both single and multiple communicating nested loops. This method provides orchestrated application specific decisions on communication architecture, memory hierarchy and computing resource parallelism. It is computationally efficient and produces high-performance architectures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.