By Topic

Computers, IEEE Transactions on

Issue 10 • Date Oct. 2005

Filter Results

Displaying Results 1 - 17 of 17
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (143 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (75 KB)  
    Freely Available from IEEE
  • Guest Editors' Introduction

    Page(s): 1185 - 1187
    Save to Project icon | Request Permissions | PDF file iconPDF (74 KB)  
    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Supporting demanding hard-real-time systems with STI

    Page(s): 1188 - 1202
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1464 KB) |  | HTML iconHTML  

    Software thread integration (STI) is a compilation technique which enables the efficient use of an application's fine-grain idle time on generic processors without special hardware support. With STI, a primary function is automatically interleaved with a secondary function to create a single implicitly multithreaded function which minimizes context switching and, hence, both improves performance and also offers very fine-grain concurrency. In this work, we extend STI techniques to address two challenges. First, we reduce response time for interrupts or other high-priority threads by introducing polling servers into integrated threads. Second, we enable integration with long host threads, expanding the domain of STI. We derive methods to evaluate the response time for threads in systems with and without these new integration methods. We demonstrate these concepts with the integration of various threads in a sample hard-real-time system on a highly-constrained microcontroller. We use an inexpensive 20 MHz AVR 8-bit microcontroller to generate monochrome NTSC video while servicing a high-speed (115,2 kbaud) serial communication link. We have built and tested this system, achieving graphics rendering speed-ups of 3.99× to 13.5×. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Frequent loop detection using efficient nonintrusive on-chip hardware

    Page(s): 1203 - 1215
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1808 KB) |  | HTML iconHTML  

    Dynamic software optimization methods are becoming increasingly popular for improving software performance and power. The first step in dynamic optimization consists of detecting frequently executed code, or "critical regions." Most previous critical region detectors have been targeted to desktop processors. We introduce a critical region detector targeted to embedded processors, with the unique features of being very size and power efficient and being completely nonintrusive to the software's execution-features needed in timing-sensitive embedded systems. Our detector not only finds the critical regions, but also determines their relative frequencies, a potentially important feature for selecting among alternative dynamic optimization methods. Our detector uses a tiny cache-like structure coupled with a small amount of logic. We provide results of extensive explorations across 19 embedded system benchmarks. We show that highly accurate results can be achieved with only a 0.02 percent power overhead, acceptable size overhead; and zero runtime overhead. Our detector is currently being used as part of a dynamic hardware/software partitioning approach, but is applicable to a wide variety of situations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Code size reduction in heterogeneous-connectivity-based DSPs using instruction set extensions

    Page(s): 1216 - 1226
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1152 KB) |  | HTML iconHTML  

    Existing trend of processors shows a progress toward customizable and reconfigurable architectures. In this paper, we study the benefit of combining the architectural design of a VLIW DSP and the concepts of modern customizable processors like ASIPs (application specific instruction set processors) for code size reduction. VLIW DSP architectures exhibit heterogeneous connections between functional units and register files for speeding up special tasks. Such architectural characteristics can be effectively exploited through the use of complex instruction set extensions (ISEs). Although VLIWs are increasingly being used for DSP applications to achieve very high performance, such architectures are known to suffer from increased code size. This paper also addresses how to generate and use ISEs that can result in significant code size reduction in VLIW DSPs without degrading performance. Unfortunately, contemporary techniques for generation of ISEs when applied before resource-binding fail to generate legal ISEs for VLIW architectures with heterogeneous connectivity between the functional units and register files. We propose a heuristic-based approach to generate ISEs for a generalized heterogeneous-connectivity-based VLIW DSP architecture. We achieve an average code size reduction of 25 percent on the MiBench suite with no penalty in performance by applying our ISE generation algorithms on the Tl TMS320C6xx, a representative VLIW DSP. We also show that the overhead of the required architectural assists for our approach is minimal: The TMS320C6xx pipeline meets the required timing with only a limited overhead in area. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Distributed data cache designs for clustered VLIW processors

    Page(s): 1227 - 1241
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1632 KB) |  | HTML iconHTML  

    Wire delays are a major concern for current and forthcoming processors. One approach to deal with this problem is to divide the processor into semi-independent units referred to as clusters. A cluster usually consists of a local register file and a subset of the functional units, while the L1 data cache typically remains centralized in What we call partially distributed architectures. However, as technology evolves, the relative latency of such a centralized cache will increase, leading to an important impact on performance. In this paper, we propose partitioning the L1 data cache among clusters for clustered VLIW processors. We refer to this kind of design as fully distributed processors. In particular; we propose and evaluate three different configurations: a snoop-based cache coherence scheme, a word-interleaved cache, and flexible LO-buffers managed by the compiler. For each alternative, instruction scheduling techniques targeted to cyclic code are developed. Results for the Mediabench suite'show that the performance of such fully distributed architectures is always better than the performance of a partially distributed one with the same amount of resources. In addition, the key aspects of each fully distributed configuration are explored. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Lattice-based memory allocation

    Page(s): 1242 - 1257
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (640 KB) |  | HTML iconHTML  

    We investigate the problem of memory reuse in order to reduce the memory needed to store an array variable. We develop techniques that can lead to smaller memory requirements in the synthesis of dedicated processors or to more effective use by compiled code of software-controlled scratchpad memory. Memory reuse is well-understood for allocating registers to hold scalar variables. Its extension to arrays has been studied recently for multimedia applications, for loop parallelization, and for circuit synthesis from recurrence equations. In all such studies, the introduction of modulo operations to an otherwise affine mapping (of loop or array indices to memory locations) achieves the desired reuse. We develop here a new mathematical framework, based on critical lattices, that subsumes the previous approaches and provides new insight. We first consider the set of indices that conflict, those that cannot be mapped to the same memory cell. Next, we construct the set of differences of conflicting indices. We establish a correspondence between a valid modular mapping and a strictly-admissible integer lattice-one having no nonzero element in common with the set of conflicting index differences. The memory required by an optimal modular mapping is equal to the determinant of the corresponding lattice. The memory reuse problem is thus reduced to the (still interesting and nontrivial) problem of finding a strictly admissible integer lattice-of least determinant. We then propose and analyze several practical strategies for finding strictly admissible integer lattices, either optimal or optimal up to a multiplicative factor, and, hence, memory-saving modular mappings. We explain and analyze previous approaches in terms of our new framework. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Automated custom instruction generation for domain-specific processor acceleration

    Page(s): 1258 - 1270
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1144 KB) |  | HTML iconHTML  

    Application-specific extensions to the computational capabilities of a processor provide an efficient mechanism to meet the growing performance and power demands of embedded applications. Hardware, in the form of new function units (or coprocessors), and the corresponding instructions are added to a baseline processor to meet the critical computational demands of a target application. In this paper, the design of a system to automate the instruction set customization process is presented. A dataflow graph design space exploration engine efficiently identifies computation subgraphs to create custom hardware and a compiler subgraph matching framework seamlessly exploits this hardware. We demonstrate the effectiveness of this system across a range of application domains and study the applicability of the custom hardware across an entire application domain. Generalization techniques are presented which enable the application-specific hardware to be more effectively used across a domain. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Some optimizations of hardware multiplication by constant matrices

    Page(s): 1271 - 1282
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1712 KB) |  | HTML iconHTML  

    This paper presents some improvements on the optimization of hardware multiplication by constant matrices. We focus on the automatic generation of circuits that involve constant matrix multiplication, i.e., multiplication of a vector by a constant matrix. The proposed method, based on number recoding and dedicated common subexpression factorization algorithms, was implemented in a VHDL generator. Our algorithms and generator have been extended to the case of some digital filters based on multiplication by a constant matrix and delay operations. The obtained results on several applications have been implemented on FPGAs and compared to previous solutions. Up to 40 percent area and speed savings are achieved. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • FIFO-based multicast scheduling algorithm for virtual output queued packet switches

    Page(s): 1283 - 1297
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2016 KB) |  | HTML iconHTML  

    Many networking/computing applications require high speed switching for multicast traffic at the switch/router level to save network bandwidth. However, existing queuing-based packet switches and scheduling algorithms cannot perform well under multicast traffic. While the speedup requirement makes the output queued switch difficult to scale, the single input queued switch suffers from head of line (HOL) blocking, which severely limits the network throughput. An efficient yet simple buffering strategy to remove the HOL blocking is to use the virtual output queued (VOQ) switch structure, which has been shown to perform well under unicast traffic. However, the traditional VOQ switch is impractical for multicast traffic because a VOQ switch for multicast traffic has to maintain an exponential number of queues in each input port (i.e., 2N-1 queues for a switch with N output ports). In this paper, we give a novel queue structure for the input buffers of a multicast VOQ switch by separately storing the address information and data information of a packet so that an input port only needs to manage a linear number (N) of queues. In conjunction with the multicast VOQ switch, we present a first-in-first-out based multicast scheduling algorithm, FIFO multicast scheduling (FIFOMS), and conduct extensive simulations to compare FIFOMS with other popular scheduling algorithms. Our results fully demonstrate the superiority of FIFOMS in both multicast latency and queue space requirement. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Concurrent detection of control flow errors by hybrid signature monitoring

    Page(s): 1298 - 1313
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1208 KB)  

    In this paper, we present a new concurrent error-detection scheme by hybrid signature to the online detection of program memory and control flow errors caused by transient and intermittent faults. The proposed hybrid signature-monitoring technique combines the vertical signature with the horizontal signature schemes. We first develop a new vertical signature based on linear additive code whose signature length could be easily adjusted. The attribute of adjustable length in vertical signature offers the feasibility to integrate the vertical signature, horizontal signature, and length of block into a single signature word. The horizontal signature mechanism can compensate for the coverage degradation due to the reduction of vertical signature length and significantly decrease the error-detection latency as well. The extensive block-based bit-error simulation and hardware-based simulated fault injection experiment are conducted to validate the effectiveness of the proposed technique: compared to the continuous signature monitoring (CSM) scheme, there are several notable enhancements accomplished in our work. One is the fault model used in our work is more realistic than the model employed in CSM. Another is the hardware-based experiments are performed so as to measure the design parameters more accurately. The final one is our scheme does not require being equipped with SEC-DED code in program memory in order to achieve the horizontal signatures if instruction bit correction is not an essential demand; as a result, our scheme is more flexible than CSM. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Systems support for preemptive disk scheduling

    Page(s): 1314 - 1326
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1016 KB) |  | HTML iconHTML  

    Allowing higher-priority requests to preempt ongoing disk IOs is of particular benefit to delay-sensitive and real-time systems. In this paper, we present semi-preemptible IO, which divides disk IO requests into small temporal units of disk commands to improve the preemptibility of disk access. We first lay out main design strategies to allow preemption of each component of a disk access-seek, rotation, and data transfer, namely, seek-splitting, JIT-seek, and chunking. We then present the preemption mechanisms for single and multidisk systems-JIT-preemption and JIT-migration. The evaluation of our prototype system showed that semi-preemptible IO substantially improved the preemptibility of disk access with little loss in disk throughput and that preemptive disk scheduling could improve the response time for high-priority interactive requests. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • [Advertisement]

    Page(s): 1327
    Save to Project icon | Request Permissions | PDF file iconPDF (340 KB)  
    Freely Available from IEEE
  • [Advertisement]

    Page(s): 1328
    Save to Project icon | Request Permissions | PDF file iconPDF (483 KB)  
    Freely Available from IEEE
  • TC Information for authors

    Page(s): c3
    Save to Project icon | Request Permissions | PDF file iconPDF (75 KB)  
    Freely Available from IEEE
  • [Back cover]

    Page(s): c4
    Save to Project icon | Request Permissions | PDF file iconPDF (143 KB)  
    Freely Available from IEEE

Aims & Scope

The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Albert Y. Zomaya
School of Information Technologies
Building J12
The University of Sydney
Sydney, NSW 2006, Australia
http://www.cs.usyd.edu.au/~zomaya
albert.zomaya@sydney.edu.au