By Topic

Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004. International Conference on

Date 8-10 Sept. 2004

Filter Results

Displaying Results 1 - 25 of 61
  • Dual-pipeline heterogeneous ASIP design

    Page(s): 12 - 17
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (567 KB) |  | HTML iconHTML  

    We demonstrate the feasibility of a dual pipeline application specific instruction set processor. We take a C program and create a target instruction set by compiling to a basic instruction set from which some instructions are merged, while others discarded. Based on the target instruction set, parallelism of the application program is analyzed and two unique instruction sets are generated for a heterogeneous dual-pipeline processor. The dual pipe processor is created by making two unique ASIPs (VHDL descriptions) utilizing the ASIP-Meister Tool Suite, and fusing the two VHDL descriptions to construct a dual pipeline processor. Our results show that in comparison to the single pipeline application specific instruction set processor, the performance improves by 27.6% and switching activity reduces by 6.1% for a number of benchmarks. These improvements come at the cost of increased area which for benchmarks considered is 16.7% on average. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast cosimulation of transformative systems with OS support on SMP computer

    Page(s): 164 - 169
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (521 KB) |  | HTML iconHTML  

    Transformative applications are a class of dataflow computation characterized by iterative behavior. The problem of partitioning a transformative application specification to a set of available hardware (HW) and software (SW) processing elements (PEs) and derivation of a job execution order (scheduling) on them has been quite well studied, but the problem of obtaining fast simulation of these applications poses different constraints. In this paper, we propose an efficient framework for a symmetric multi-processor (SMP) simulation host to achieve fast HW/SW co-simulation for transformative applications, given the partition solutions and the derived schedulers. The framework overcomes the limitations in existing Linux SMP kernel and requires only a reasonable amount of modifications to it. We also present a heuristic algorithm which effectively assigns simulation tasks to the processors on the simulation host, considering both average job simulation time on each processor and other simulation overhead. Our experiments show that the algorithm is able to find satisfactory suboptimal solutions with very little computation time. Based on the task assignment solution, the simulation time can be reduced by 25% to 50% from the obvious but naive approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast cycle-accurate simulation and instruction set generation for constraint-based descriptions of programmable architectures

    Page(s): 18 - 23
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (545 KB) |  | HTML iconHTML  

    State-of-the-art architecture description languages have been successfully used to model application-specific programmable architectures limited to particular control schemes. We introduce a language and methodology that provide a framework for constructing and simulating a wider range of architectures. The framework exploits the fact that designers are often only concerned with data paths, not the instruction set and control. In the framework, each processing element is described in a structural language that only requires the specification of the data path and constraints on how it can be used. From such a description, the supported operations of the processing clement are automatically extracted and a controller is generated. Various architectures are then realized by composing the processing elements. Furthermore, hardware descriptions and bit-true cycle-accurate simulators are automatically generated. Results show that our simulators are up to an order of magnitude faster than other reported simulators of this type and two orders of magnitude faster than equivalent Verilog simulations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Power-aware communication optimization for networks-on-chips with voltage scalable links

    Page(s): 170 - 175
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (580 KB) |  | HTML iconHTML  

    Networks-on-chip (NoC) is emerging as a practical development platform for future systems-on-chip products. We propose an energy-efficient static algorithm which optimizes the energy consumption of task communications in NoCs with voltage scalable links. In order to find optimal link speeds, the proposed algorithm (based on a genetic formulation) globally explores the design space of NoC-based systems, including task assignment, tile mapping, routing path allocation, task scheduling and link speed assignment. Experimental results show that the proposed design technique can reduce energy consumption by 28% on average compared with existing techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Future challenges in embedded systems

    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (202 KB)  

    Embedded systems play a key role to drive the technological evolution in the next 20 years. Their evolution further accelerates with the diffusion of the technologies that deeply change our scenario, among which we can mention nanotechnologies, bioelectronics, and photonics. The central role of embedded systems in the economy grows stronger and stronger: the starting point is the convergence between storage, security, video, audio, mobility and connectivity. Systems are converging and ICs are more and more converging with systems: this poses a number of challenges for designers and technologists. A key issue is the definition of the right methodologies to translate system knowledge and competences into complex embedded systems, taking into account many system requirements and constraints. The key factor to win this challenge is to build the right culture. This means to be able to build the right environment to exploit existing design, architectural and technological solutions, and to favor the transfer of knowledge from one application field into another. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Transaction level modeling: flows and use models

    Page(s): 75 - 80
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (551 KB) |  | HTML iconHTML  

    Transaction-level models (TLMs) address the problems of designing increasingly complex systems by raising the level of design abstraction above RTL. However, TLM terminology is presently a subject of contentious debate and a coherent set of TLM use-models have not been proposed. In This work we propose a variety of TLM use-models that reveal paths through the TLM abstraction levels for various types of system. We begin by stating the abstraction levels that comprise 'transaction-level' and identify roles and responsibilities that apply within the use-models. We then take each use-model and discuss the type of system it applies to, the TLM abstraction levels it supports, and the design activites applied at those levels. We also consider the distribution of modeling effort between the various design roles and apply that to descriptions of various use-model design flows. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting polymorphism in HW design: a case study in the ATM domain

    Page(s): 81 - 85
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (456 KB) |  | HTML iconHTML  

    The need of raising the level of abstraction and improving reuse in HW design suggests the adoption of an object-oriented (OO) design methodology based on SystemC-Plus (i.e. an enhanced SystemC). Such a methodology, developed during the ODETTE IST project, allows the exploitation of the key features of the OO paradigm (i.e. information hiding, inheritance, and polymorphism) at the behavioral level of description while guaranteeing synthesizability. In this context, the goal of This work is to highlight advantages and drawbacks derived from the exploitation of polymorphism in the design of an ATM component: the UTOPIA cells handler. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Full text access may be available. Click article title to sign in or learn about subscription options.
  • Current flattening in software and hardware for security applications

    Page(s): 218 - 223
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (540 KB) |  | HTML iconHTML  

    This work presents a new current flattening technique applicable in software and hardware. This technique is important in embedded cryptosystems since power analysis attacks (that make use of the current variation dependency on data and program) compromise the security of the system. The technique flattens the current internally by exploiting current consumption differences at the instruction level. Code transformations supporting current variation reductions due to program dependencies are presented. Also, real-time hardware architecture capable of reducing the current to data and program dependencies is proposed. Measured and simulated current waveforms of cryptographic software are presented in support of these techniques. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Detecting overflow detection

    Page(s): 36 - 41
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (480 KB) |  | HTML iconHTML  

    Fixed-point saturating arithmetic is widely used in media and digital signal processing applications. A number of processor architectures provide instructions that implement saturating operations. However, standard high-level languages, such as ANSI C, provide no direct support for saturating arithmetic. Applications written in standard languages have to implement saturating operations in terms of basic two's complement operations. In order to provide fast execution of such programs it is important to have an optimizing compiler automatically detect and convert appropriate code fragments to hardware instructions. We present a set of techniques for automatic recognition of saturating arithmetic operations. We show that in most cases the recognition problem is simply one of Boolean circuit equivalence. Given the expense of solving circuit equivalence, we develop a set of practical approximations based on abstract interpretation. Experiments show that our techniques, while reliably recognizing saturating arithmetic, have small compile-time overhead. We also demonstrate that our approach is not limited to saturating arithmetic, but is directly applicable to recognizing other idioms, such as add-with-carry and absolute value. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Full text access may be available. Click article title to sign in or learn about subscription options.
  • Energy-efficient flash-memory storage systems with an interrupt-emulation mechanism

    Page(s): 134 - 139
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (534 KB) |  | HTML iconHTML  

    One of the emerging critical issues for flash-memory storage systems, especially on the implementations of many embedded systems, is on its programmed I/O nature for data transfers. Programmed-I/O-based data transfers might not only result in the wasting of valuable CPU cycles of microprocessors but also unnecessarily consume much more energy from batteries. This work presents an interrupt-emulation mechanism for flash-memory storage systems with an energy-efficient management strategy. We propose to revise the waiting function in the memory-technology-device (MTD) layer to relieve the microprocessor from busy waiting and to reduce the energy consumption of the system. We show that energy consumption could be significantly reduced with good saving on CPU cycles and minor delay on the average response time in the experiments. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A loop accelerator for low power embedded VLIW processors

    Page(s): 6 - 11
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (556 KB) |  | HTML iconHTML  

    The high transistor density afforded by modern VLSI processes has enabled the design of embedded processors that use clustered execution units to deliver high levels of performance. However, delivering data to the execution resources in a timely manner remains a major problem that limits ILP. It is particularly significant for embedded systems where memory and power budgets are limited. A distributed address generation and loop acceleration architecture for VLIW processors is presented. This decentralized on-chip memory architecture uses multiple SRAMs to provide high intra-processor bandwidth. Each SRAM has an associated stream address generator capable of implementing a variety of addressing modes in conjunction with a shared loop accelerator. The architecture is extremely useful for generating application specific embedded processors, particularly for processing input data which is organized as a stream. The idea is evaluated in the context of a fine grain VLIW architecture executing complex perception algorithms such as speech and visual feature recognition. Transistor level Spice simulations are used to demonstrate a 159x improvement in the energy delay product when compared to conventional architectures executing the same applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Organic computing - on the feasibility of controlled emergence

    Page(s): 2 - 5
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (396 KB) |  | HTML iconHTML  

    This work gives an introduction to the research area of organic computing and shows chances, opportunities and problems currently tackled by researchers. First the visions that lead to this research area are discussed briefly. It is shown that the notion of emergence, a central phenomenon in organic computing, is a typical bottom-up effect with the interesting property of generating order from randomness. The classical design, however, is a top-down process. This apparent contradiction can be overcome by introducing so-called observer/controller architectures leading to the possibility to controlled emergence. The paper concludes with a description of current research problems in organic computing. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • CPU scheduling for statistically-assured real-time performance and improved energy efficiency

    Page(s): 110 - 115
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (562 KB) |  | HTML iconHTML  

    We present a CPU scheduling algorithm, called energy-efficient utility accrual algorithm (or EUA), for battery-powered, embedded real-time systems. We consider an embedded software application model where repeatedly occurring application activities are subject to deadline constraints specified using step time/utility functions. For battery-powered embedded systems, system-level energy consumption is also a primary concern. We consider CPU scheduling that (1) provides assurances on individual and collective application timeliness behaviors and (2) maximizes system-level timeliness and energy efficiency. Since the scheduling problem is intractable, EUA heuristically computes CPU schedules with a polynomial-time cost. Several properties of EUA are analytically established, including timeliness optimality during under-load situations and statistical assurances on timeliness behavior. Further, our simulation results confirm EUA's superior performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Multi-objective mapping for mesh-based NoC architectures

    Page(s): 182 - 187
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (520 KB) |  | HTML iconHTML  

    We present an approach to multi-objective exploration of the mapping space of a mesh-based network-on-chip architecture. Based on evolutionary computing techniques, the approach is an efficient and accurate way to obtain the Pareto mappings that optimize performance and power consumption. Integration of the approach in an exploration framework with a kernel based on an event-driven trace-based simulator makes it possible to take account of important dynamic effects that have a great impact on mapping. Validation on both synthesized traffic and real applications (an MPEG-2 encoder/decoder system) confirms the efficiency, accuracy and scalability of the approach. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient search space exploration for HW-SW partitioning

    Page(s): 122 - 127
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (580 KB) |  | HTML iconHTML  

    Hardware/software (HW-SW) partitioning is a key problem in the codesign of embedded systems, studied extensively in the past. One major open challenge for traditional partitioning approaches - as we move to more complex and heterogeneous SoCs - is the lack of efficient exploration of the large space of possible HW/SW configurations, coupled with the inability to efficiently scale up with larger problem sizes. We make two contributions for HW-SW partitioning of applications represented as procedural call-graphs: 1) we prove that during partitioning, the execution time metric for moving a vertex needs to be updated only for the immediate neighbours of the vertex, rather than for all ancestors along paths to the root vertex; consequently, we observe faster run-times for move-based partitioning algorithms such as simulated annealing (SA), allowing call graphs with thousands of vertices to be processed in less than a second, and 2) we devise a new cost function for SA that allows frequent discovery of better partitioning solutions by searching spaces overlooked by traditional SA cost functions. We present experimental results on a very large design space, where several thousand configurations are explored in minutes as compared to several hours or days using a traditional SA formulation. Furthermore, our approach is frequently able to locate better design points with over 10 % improvement in application execution time compared to the solutions generated by a Kernighan-Lin partitioning algorithm starting with an all-SW partitioning. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analyzing heap error behavior in embedded JVM environments

    Page(s): 230 - 235
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (617 KB) |  | HTML iconHTML  

    Recent studies have shown that transient hardware errors caused by external factors such as alpha particles and cosmic ray strikes can be responsible for a large percentage of system down-time. Denser processing technologies, increasing clock speeds, and low supply voltages used in embedded systems can worsen this problem. In many embedded environments, one may not want to provision extensive error protection in hardware because of (i) form-factor or power consumption limitations, and/or (ii) to keep costs low. Also, the mismatch between the hardware protection granularity and the field access granularity can lead to false alarms and error cancellations. Consequently, software-based approaches to identify and possibly rectify these errors seem to be promising. Towards this goal, This work specifically looks to enhance the software's ability to detect heap memory errors in a Java-based embedded system. Using several embedded Java applications, This work first studies the tradeoffs between reliability, performance, and memory space overhead for two schemes that perform error checks at object and field granularities. We also study the impact of object characteristics (e.g., lifetime, re-use intervals, access frequency, etc.) on error propagation. Considering the pros and cons of these two schemes, we then investigate two hybrid strategies that attempt to strike a balance between memory space and performance overheads and reliability. Our experimental results clearly show that the granularity of error protection and its frequency can significantly impact static/dynamic overheads and error detection ability. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient exploration of on-chip bus architectures and memory allocation

    Page(s): 248 - 253
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (599 KB) |  | HTML iconHTML  

    Separation between computation and communication in system design allows the system designer to explore the communication architecture independently of component selection and mapping. We present an iterative two-step exploration methodology for bus-based on-chip communication architecture and memory allocation, assuming that memory traces from the processing elements are given from the mapping stage. The proposed method uses a static performance estimation technique to reduce the large design space drastically and quickly, and applies a trace-driven simulation technique to the reduced set of design candidates for accurate performance estimation. Since local memory traffic as well as shared memory traffic are involved in bus contention, memory allocation is considered as an important axis of the design space in our technique. The viability and efficiency of the proposed methodology are validated by two real-life examples, 4-channel digital video recorder (DVR) and an equalizer for OFDM DVB-T receiver. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A timing-accurate HW/SW cosimulation of an ISS with SystemC

    Page(s): 152 - 157
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (532 KB) |  | HTML iconHTML  

    The paper presents a system level co-simulation methodology for modeling, validating, and analyzing the performance of embedded systems. The proposed solution relies on the integration between an instruction set simulator (ISS) and the SystemC simulation kernel. In this way, the ISS is used to abstract the model of the real programmable device where the SW should run, while SystemC is used to model HW components that interact with the SW. A correct validation of such an architecture is infeasible without taking care of timing information. Thus, the paper proposes an effective timing synchronization mechanism, which uses timing information of an ISS (or a board) to synchronize the SystemC simulation. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Modeling operation and microarchitecture concurrency for communication architectures with application to retargetable simulation

    Page(s): 66 - 71
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (584 KB) |  | HTML iconHTML  

    In multiprocessor based SoCs, optimizing the communication architecture is often as important as, if not more than, optimizing the computation architecture. While there are mature platforms and techniques for the modeling and evaluation of computation architectures, the same is not true for the communication architectures. A major challenge in modeling the communication architecture is managing the concurrency at multiple levels: at the operation level, multiple communication operations may be active at any time; at the microarchitecture level, several microarchitectural components may be operating in parallel. Further, it is important to be able to clearly specify how the operation level concurrency maps to the microarchitectural level concurrency. This work presents a modeling methodology and a retargetable simulation framework which fill this gap. This framework seeks to facilitate the design space exploration of the communication sub-system through a rigorous modeling approach based on a formal concurrency model, the operation state machine (OSM). We first introduce the basic notions and concepts of OSM and show by example how this model can be used to represent the inherent concurrency in the architecture and microarchitecture of processors. Then we demonstrate the applicability of OSM in modeling on-chip communication architectures (OCAs) by walking though a router based packet switching network example and a bus example. Due to the fact that the OSM model is naturally suited to handle the operation and microarchitecture level concurrencies of OCAs as well, our OSM-based modeling methodology enables the entire system including both the computation and communication architectures to be modeled in a single OSM framework. This allows us to develop a tool set that can synthesize cycle-accurate system simulators for multi-PE SoC prototypes. To demonstrate the flexibility of this methodology, we choose two distinct system configurations with different types of OCA: a 4×4 mesh network of 16 PEs, and a cluster of 4 PEs connected by a bus. We show that by simulation, critical system information such as timing and communication patterns can be obtained and evaluated. Consequently, system-level design choices regarding the communication architect- ure can be made with high confidence in early stages of design. In addition to improving design quality, this methodology also results in significantly shortened design-time. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Memory system design space exploration for low-power, real-time speech recognition

    Page(s): 140 - 145
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (556 KB) |  | HTML iconHTML  

    The recent proliferation of computing technology has generated new interest natural I/O interface technologies such as speech recognition. Unfortunately, the computational and memory demands of such applications currently prohibit their use on low-power portable devices in anything more than their simplest forms. Previous work has demonstrated that the thread level concurrency inherent in this application domain can be used to dramatically improve performance with minimal impact on overall system energy consumption, but that such benefits are severely constrained by memory system bandwidth. This work presents a design space exploration of potential memory system architectures. A range of low-power memory organizations are considered, from conventional caching to more advanced system-on-chip implementations. We find that, given architectures able to exploit concurrency in this domain, large L2 based cache hierarchies and high bandwidth memory systems employing data stream partitioning and on-chip embedded DRAM and ROM technologies can provide much of the performance of idealized memory systems without violating the power constraints of the low-power domain. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Benchmark-based design strategies for single chip heterogeneous multiprocessors

    Page(s): 54 - 59
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (597 KB) |  | HTML iconHTML  

    Single chip heterogeneous multiprocessors are arising to meet the computational demands of portable and handheld devices. These computing systems are not fully custom designs traditionally targeted by the design automation (DA) community, general purpose designs traditionally targeted by the computer architecture (CA) community, nor pure embedded designs traditionally targeted by the real-time (RT) community. An entirely new design philosophy will be needed for this hybrid class of computing. The programming of the device will be drawn from a narrower set of applications with execution that persists in the system over a longer period of time than for general purpose programming. But the devices will still be programmable, not only at the level of the individual processing element, but across multiple processing elements and even the entire chip. The design of other programmable single chip computers has enjoyed an era where the design trade-offs could be captured in simulators such as SimpleScalar and performance could be evaluated to the SPEC benchmarks. Motivated by this, we describe new benchmark-based design strategies for single chip heterogeneous multiprocessors. We include an example and results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • System-on-chip validation using UML and CWL

    Page(s): 92 - 97
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (478 KB) |  | HTML iconHTML  

    A novel method for high-level specification and validation of SoC designs using UML is proposed. UML is introduced as a formal model of specification for SoC design. The consistency and completeness of the specification is validated based on the formal UML model. The implementation is validated by a systematic derivation of test scenarios and specification based coverage metrics from the UML model. The method has been applied to the design of a new media-processing chip for mobile devices. The application of the method shows that it is not only effective for finding logical errors in the implementation, but also eliminates errors due to inconsistency and incompleteness of the specification. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Power-performance trade-offs for reconfigurable computing

    Page(s): 116 - 121
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (535 KB) |  | HTML iconHTML  

    We explore the system-level power-performance trade-offs available when implementing streaming embedded applications on fine-grained reconfigurable architectures. We show that an efficient hardware-software partitioning algorithm is required when targeting low-power. However, if the application objective is performance, then we propose the use of dynamically reconfigurable architectures. This work presents a configuration-aware data size partitioning approach. We propose a design methodology that adapts the architecture and used algorithms to the application requirements. The methodology has been proven to work on a real research platform based on Xilinx devices. Finally, we have applied our methodology and algorithms to the case study of image sharpening, which is required nowadays in digital cameras and mobile phones. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.