By Topic

Embedded Computer Systems: Architectures, Modeling and Simulation, 2007. IC-SAMOS 2007. International Conference on

Date 16-19 July 2007

Filter Results

Displaying Results 1 - 25 of 36
  • [Front cover]

    Page(s): C1
    Save to Project icon | Request Permissions | PDF file iconPDF (84 KB)  
    Freely Available from IEEE
  • [Title page]

    Page(s): C2
    Save to Project icon | Request Permissions | PDF file iconPDF (16 KB)  
    Freely Available from IEEE
  • Copyright page

    Page(s): nil1
    Save to Project icon | Request Permissions | PDF file iconPDF (208 KB)  
    Freely Available from IEEE
  • Preface

    Page(s): nil2
    Save to Project icon | Request Permissions | PDF file iconPDF (183 KB)  
    Freely Available from IEEE
  • IC-SAMOS Organization

    Page(s): nil3 - nil5
    Save to Project icon | Request Permissions | PDF file iconPDF (34 KB)  
    Freely Available from IEEE
  • Referees

    Page(s): nil6 - nil7
    Save to Project icon | Request Permissions | PDF file iconPDF (32 KB)  
    Freely Available from IEEE
  • Table of contents

    Page(s): nil8 - nil10
    Save to Project icon | Request Permissions | PDF file iconPDF (74 KB)  
    Freely Available from IEEE
  • In Memoriam Stamatis Vassiliadis (1951 - 2007)

    Page(s): nil11
    Save to Project icon | Request Permissions | PDF file iconPDF (73 KB)  
    Freely Available from IEEE
  • Applying Data Mapping Techniques to Vector DSPs

    Page(s): 1 - 8
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2414 KB) |  | HTML iconHTML  

    Vector DSPs offer a good performance to power consumption ratio. Therefore, they are suitable for mobile devices in software defined radio applications. These vector DSPs require input algorithms with vector operations. The performance of vectorized algorithms to a great extent depends on the distribution of data on vector elements. Traditional algorithms for vectorization focus on the extraction of parallelism from a program; we propose an analysis tool that focuses on the selection of an efficient dynamic data mapping for vector DSPs. We transferred Garcia's communication parallelism graph for distributed memory multiprocessor systems to vector DSPs. By alternating the representation of two-dimensional data distributions and the cost models, we are able to determine a dynamic mapping of data on vector elements on the EVP. Additionally, we propose a new efficient algorithm for processing the graph representation. We demonstrate our tool by describing the vectorization of two MIMO OFDM algorithms. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Instruction Set Encoding Optimization for Code Size Reduction

    Page(s): 9 - 17
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1207 KB) |  | HTML iconHTML  

    In an embedded system, the cost of storing a program on-chip can be as high as the cost of the microprocessor itself. We examine how much a given application's program size can be reduced when an instruction set is tailored to the application. We provide different algorithms for calculating an optimized instruction set and evaluate their impact on the size of several benchmark programs. Our results show that an average reduction of 11% is possible, and further improvement can be achieved by changing the instruction length of the given architecture. However compiling other applications with such an optimized instruction set might produce larger code sizes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FlexCore: Utilizing Exposed Datapath Control for Efficient Computing

    Page(s): 18 - 25
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3134 KB) |  | HTML iconHTML  

    rdquoWe introduce FlexCore, the first exemplar of an architecture based on the FlexSoC framework. Comprising the same datapath units found in a conventional five-stage pipeline, the FlexCore has an exposed datapath control and a flexible interconnect to allow the datapath to be dynamically reconfigured as a consequence of code generation. Additionally, the FlexCore allows specialized datapath units to be inserted and utilized within the same architecture and compilation framework. This study shows that, in comparison to a conventional five-stage general-purpose processor, the FlexCore is up to 40% more efficient in terms of cycle count on a set of benchmarks from the embedded application domain. We show that both the fine grained control and the flexible interconnect contribute to the speedup. Furthermore, according to our VLSI implementation study, the FlexCore architecture offers both time and energy savings. The exposed FlexCore datapath requires a wide control word. The conducted evaluation confirms that this increases the instruction bandwidth and memory footprint. This calls for efficient instruction decoding as proposed in the FlexSoC View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Prototyping Efficient Interprocessor Communication Mechanisms

    Page(s): 26 - 33
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4743 KB) |  | HTML iconHTML  

    Parallel computing systems are becoming widespread and grow in sophistication. Besides simulation, rapid system prototyping becomes important in designing and evaluating their architecture. We present an efficient FPGA-based platform that we developed and use for research and experimentation on high speed interprocessor communication, network interfaces and interconnects. Our platform supports advanced communication capabilities such as remote DMA, remote queues, zero-copy data delivery and flexible notification mechanisms, as well as link bundling for increased performance. We report on the platform architecture, its design cost, complexity and performance (latency and throughput). We also report our experiences from implementing benchmarking kernels and a user-level benchmark application, and show how software can take advantage of the provided features, but also expose the weaknesses of the system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design Space Exploration of Configuration Manager for Network Processing Applications

    Page(s): 34 - 40
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2408 KB) |  | HTML iconHTML  

    Current FPGAs provide a powerful platform for network processing applications. The main challenge is the exploitation of the reconfiguration to increase the performance of the system. In this paper, a design space exploration framework is presented to design a reconfigurable platform for multi-service network processing applications. An integrated design flow is presented from the system level analytical design to the implementation level. Furthermore, the design of an efficient configuration manager is presented in which the platform adaptation is performed for optimum speedup with minimum overhead taking into account the reconfiguration overhead and the network characteristics (packet type distribution, network stability). Finally, a case study is presented in which the platform is used to process three network flows with different processing requirements. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design Space Exploration of Media Processors: A Parameterized Scheduler

    Page(s): 41 - 49
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2330 KB) |  | HTML iconHTML  

    This paper describes an enhanced list scheduling algorithm used on a parameterized assembler. The assembler, which is configurable in terms of architectural parameters, is used on a new environment system for exploring and optimizing VLIW architectures for multimedia applications. A generic VLIW architecture with a novel register file structure is used as a base architecture. The proposed scheduling algorithm includes sophisticated features. A backtracking technique allows to undo inappropriate scheduling decisions, while an advanced resource conflict function allows to work with different VLIW architecture configurations. Moreover, local register allocation in conjunction with the instruction scheduling process is also implemented for obtaining better code compaction. Two different multimedia tasks are implemented to check the correctness of the generated code for different architecture configurations. The code compaction efficiency, when scheduling these applications for different VLIW architecture configurations with a partitioned register file and limited number of functional units, reaches up to 94% of the compaction efficiency for the same configuration with an unconstrained register file and unlimited number of functional units. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automatic Bus Matrix Synthesis based on Hardware Interface Selection for Fast Communication Design Space Exploration

    Page(s): 50 - 57
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (6414 KB) |  | HTML iconHTML  

    In this paper, we present an automated bus matrix synthesis flow for efficient system-on-chip communication design space exploration at the transaction level. Especially, we consider hardware interface design, since it affects overall system cost and performance. Depending on the bus interface, a hardware block can be a master or a slave. We propose a method to solve such hardware interface selection problem by analyzing communication behavior statically. In addition, in order to explore communication design space fast, we automatically generate transaction level models for the hardware blocks according to the hardware interface selection. The synthesis result is verified by transaction level simulation with a commercial tool. We give experimental results with JPEG encoder and H.264 encoder to demonstrate the efficiency of the proposed method. The results show that with our automated synthesis flow, the designer can easily and quickly obtain better communication designs through fast design space exploration. More specifically, our hardware interface selection technique is successful in achieving reduction of area of bus matrix by 41.43% with 0.58% performance overhead on average compared to the case of maximum performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Systematic Data Structure Exploration of Multimedia and Network Applications realized Embedded Systems

    Page(s): 58 - 65
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (4744 KB) |  | HTML iconHTML  

    In the last years, there is a trend towards network and multimedia applications to be implemented in portable devices. These applications usually contain complex dynamic data structures. The appropriate selection of the dynamic data type (DDT) combination of an application affects the performance and the energy consumption of the whole system. Thus, DDT exploration methodology is used to perform tradeoffs between design factors, such as performance and energy consumption. In this paper we provide a new approach to the DDT exploration procedure, based on a new library of DDTs which remedies the limitations of an existing and allows the DDT optimization of a wide range of application domains. Using the new library, we performed DDT exploration in network and multimedia benchmarks and achieved performance and energy consumption improvements up to 85% and 43% respectively. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On the Problem of Minimizing Workload Execution Time in SMT Processors

    Page(s): 66 - 73
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1213 KB) |  | HTML iconHTML  

    Most research work on (simultaneous multithreading processors) SMTs focuses on improving throughput and/or fairness, or on prioritizing some threads over others in a workload. In this paper, we discuss a new problem not previously addressed in the SMT literature. We call this problem workload execution time (WET) minimization. It consists of reducing the total execution time of all threads in a workload. This problem arises in parallel applications, where it is common for a single master thread to spawn several child jobs. The master job cannot continue until all child jobs have finished. Reducing the overall execution time is important to speedup the application. This paper is a first step in analyzing this problem. First, we analyze the WET provided by the best fetch policies turned at improving throughput/fairness. We demonstrate that these policies achieve less than optimum performance. We show that, on average, for the workloads evaluated in this paper, there is space for improvement of up to 18 percentage points. It follows that novel mechanisms trying to reduce WET are required to speedup parallel applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance and Power Analysis of Parallelized Implementations on an MPCore Multiprocessor Platform

    Page(s): 74 - 81
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (13275 KB) |  | HTML iconHTML  

    In this contribution, the potential of parallelized software that implements algorithms of digital signal processing on a multicore processor platform is analyzed. For this purpose various digital signal processing tasks have been implemented on a prototyping platform i.e. an ARM MPCore featuring four ARM 11 processor cores. In order to analyze the effect of parallelization on the resulting performance-power ratio, influencing parameters like e.g. the number of issued program threads have been studied. For paralllelization issues the OpenMP programming model has been used which can be efficiently applied on C- level. In order to elaborate power efficient code also a functional and instruction level power model of the MPCore has been derived which features a high estimation accuracy. Using this power model and exploiting the capabilities of OpenMP a variety of exemplary tasks could be efficiently parallelized. The general efficiency potential of parallelization for multiprocessor architectures can be assembled. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An Interrupt Controller for FPGA-based Multiprocessors

    Page(s): 82 - 87
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3569 KB) |  | HTML iconHTML  

    Interrupt-based programming is widely used for interfacing a processor with peripherals and allowing software threads to interact. Many hardware/software architectures have been proposed in the past to support this kind of programming practice. In the context of FPGA-based multiprocessors this topic has not been thoroughly faced yet. This paper presents the architecture of an interrupt controller for a FPGA-based multiprocessor composed of standard off-of-the-shelf softcores. The main feature of this device is to distribute multiple interrupts across the cores of a multiprocessor. In addition, our architecture supports several advanced features like booking, broadcasting and inter-processor interrupt. On the top of this hardware layer, we provide a software library to effectively exploit this mechanism. We realized a prototype of this system. Our experiments show that our interrupt controller efficiently distributes multiple interrupts on the system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Application Case Studies on HS-Scale, a MP-SOC for Embbeded Systems

    Page(s): 88 - 95
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (18064 KB) |  | HTML iconHTML  

    Scalability of architecture, programming model and task control management will be a major challenge for MP-SOC designs in the coming years. The contribution presented in this paper is HS-Scale, a hardware/software framework to study, define and experiment scalable solutions for next generation MP-SOC. The hardware architecture, H-Scale, is a homogeneous MP-SOC based on RISC processors, distributed memories and a globally asynchronous/locally synchronous network on chip. S-Scale is the software support to program H-Scale. It is a multithreaded sequential programming model with dedicated communication primitives handled at run-time by a simple operating system we developed. The hardware validations on FPGA and CMOS 90 nm technology and the experimental case studies on several applications (FIR, DES and MJPEG) demonstrate the scalability of our approach and draws interesting perspectives to automate task placement and duplication. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Hardware/Software Architecture for Tool Path Computation. An Application to Turning Lathe Machining

    Page(s): 96 - 102
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3386 KB) |  | HTML iconHTML  

    Tool path generation is one of the most complex problems in computer aided manufacturing. Although some efficient strategies have been developed, most of them are only useful for standard machining. The algorithm called virtual digitizing avoids this problem by its own definition but its computing cost is high and makes it difficult for being integrated in standard machining in order to adopt the new ISO standard 14649. Presented in the paper there is a virtual digitizing hardware/software architecture that takes advantage of field programmable gate arrays (FPGAs) to improve the algorithm efficiency and to meet the actual restrictions of the traditional computer numeric control systems at the same time. In order to evaluate the architecture, a prototype was implemented using a commercial reconfigurable platform integrated within a CNC lathe for shoe last machining. The performance of the system for tool path generation was measured for different trajectory and surface precisions using a database of real shoe models. The experiments show a significant speedup for all the cases and maintaining the error of the results below the maximum allowed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Energy efficiency of mobile video decoding

    Page(s): 103 - 109
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5754 KB) |  | HTML iconHTML  

    In this paper, we consider the energy efficiency of implementations of video codecs for mobile devices in a top-down manner. We start from typical applications and analyse device architectures, codec implementations, and software platforms. The physical size of mobile devices limits their heat dissipation, while the battery capacity needs to be used conservingly to provide for satisfactory untethered active use time. Together with the required versatile capabilities of the devices, these are essential constraints that must be taken into account from hardware to application software design. In video decoding additional constraints come from the need to support multiple digital video coding standards, and the platform oriented design regimes of the device manufacturers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Instruction-Level Fault Tolerance Configurability

    Page(s): 110 - 117
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (5096 KB) |  | HTML iconHTML  

    Fault tolerance (FT) is becoming increasingly important in computing systems. FT features are based on some form of redundancy, which adds a significant cost to a system, either increasing the required amount of hardware resources or degrading performance. To enable a user to choose between stronger FT or performance, some schemes have been proposed, which can be configured for each application to use the available redundancy to increase either reliability or performance. We propose to have an instruction-level, rather than application-level, configurability of this kind, since some applications (for example, multimedia) can have different reliability requirements for their different parts. We propose to apply weaker (or no) FT techniques to the less critical parts. This yields a certain time or resource gain, which can be used to apply stronger FT techniques to the more critical parts, thereby, increasing the overall FT. We show how some existing FT techniques can be adapted to support instruction-level FT configurability, and how a programmer can specify the desired FT of particular instructions or blocks of instructions in assembly or in a high-level programming language. In some cases compiler can assign the FT level to instructions automatically. Experimental results demonstrate that reducing the FT of non-critical instructions can lead to significant performance gains compared to a redundant execution of all the instructions. The fault coverage of this scheme is also evaluated, demonstrating that it is very application-specific. For some applications the fault coverage is very admissible, but unacceptable for others. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Weight-Watcher Service and its Lightweight Implementation

    Page(s): 118 - 127
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2417 KB) |  | HTML iconHTML  

    This paper presents the weight-watcher service. This service aims at providing resource consumption measurements and estimations for software executing on resource-constrained devices. By using the weight-watcher, software components can continuously adapt and optimize their quality of service with respect to resource availability. The interface of the service is composed of a profiler and a predictor. We present an implementation that is lightweight in terms of CPU and memory. We also performed various experiments that convey (a) the tradeoff between the memory consumption of the service and the accuracy of the prediction, as well as (b) a maximum overhead of 10% on the execution speed of the VM for the profiler to provide accurate measurements. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • COSMOS: A System-Level Modelling and Simulation Framework for Coprocessor-Coupled Reconfigurable Systems

    Page(s): 128 - 136
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1987 KB) |  | HTML iconHTML  

    Dynamically reconfigurable systems demand complicated run-time management. Due to resource constraints and reconfiguration latencies, efficient reconfiguration strategies that can reduce the overhead cost of dynamic reconfiguration need to be studied. In this paper, we i) propose a reconfigurable task model which extends the classical real-time task model to support the additional states and latencies needed to capture dynamically reconfigurable behavior, ii) propose a coprocessor- coupled reconfigurable architecture which has hardware runtime support for task execution, task reallocation and resource management, and iii) present a SystemC based framework to model and simulate coprocessor-coupled reconfigurable systems. We illustrate how COSMOS may be used to capture the dynamic behavior of such systems and emphasize the need for capturing the system aspects of such systems in order to deal with future design challenges of dynamically reconfigurable systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.