Scheduled System Maintenance:
Some services will be unavailable Sunday, March 29th through Monday, March 30th. We apologize for the inconvenience.
By Topic

Computer Architecture, 2005. ISCA '05. Proceedings. 32nd International Symposium on

Date 4-8 June 2005

Filter Results

Displaying Results 1 - 25 of 54
  • Proceedings. 32nd International Symposium on Computer Architecture

    Publication Year: 2005
    Save to Project icon | Request Permissions | PDF file iconPDF (46 KB)  
    Freely Available from IEEE
  • 32nd International Symposium on Computer Architecture - Title Page

    Publication Year: 2005 , Page(s): i - iii
    Save to Project icon | Request Permissions | PDF file iconPDF (45 KB)  
    Freely Available from IEEE
  • 32nd International Symposium on Computer Architecture - Copyright Page

    Publication Year: 2005 , Page(s): iv
    Save to Project icon | Request Permissions | PDF file iconPDF (46 KB)  
    Freely Available from IEEE
  • 32nd International Symposium on Computer Architecture - Table of contents

    Publication Year: 2005 , Page(s): v - viii
    Save to Project icon | Request Permissions | PDF file iconPDF (46 KB)  
    Freely Available from IEEE
  • General Chair's message

    Publication Year: 2005 , Page(s): ix
    Save to Project icon | Request Permissions | PDF file iconPDF (24 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Program Chairs' message

    Publication Year: 2005 , Page(s): x - xv
    Save to Project icon | Request Permissions | PDF file iconPDF (57 KB) |  | HTML iconHTML  
    Freely Available from IEEE
  • Committees

    Publication Year: 2005 , Page(s): xvi
    Save to Project icon | Request Permissions | PDF file iconPDF (23 KB)  
    Freely Available from IEEE
  • list-reviewer

    Publication Year: 2005 , Page(s): xvii - xviii
    Save to Project icon | Request Permissions | PDF file iconPDF (24 KB)  
    Freely Available from IEEE
  • Architecture for protecting critical secrets in microprocessors

    Publication Year: 2005 , Page(s): 2 - 13
    Cited by:  Papers (21)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (152 KB) |  | HTML iconHTML  

    We propose "secret-protected (SP)" architecture to enable secure and convenient protection of critical secrets for a given user in an on-line environment. Keys are examples of critical secrets, and key protection and management is a fundamental problem - often assumed but not solved nderlying the use of cryptographic protection of sensitive files, messages, data and programs. SP-processors contain a minimalist set of architectural features that can be built into a general-purpose microprocessor to provide protection of critical secrets and their computations, without expensive or inconvenient auxiliary hardware. SP-architecture also requires a trusted software module, a few modifications to the operating system, a secure I/O path to the user, and a secure installation process. Unique aspects of our architecture include: decoupling of user secrets from the devices, enabling users to securely access their keys from different networked computing devices; the use of symmetric master keys rather than more costly public-private key pairs; and the avoidance of any permanent or factory-installed device secrets. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • High efficiency counter mode security architecture via prediction and precomputation

    Publication Year: 2005 , Page(s): 14 - 24
    Cited by:  Papers (2)  |  Patents (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1344 KB) |  | HTML iconHTML  

    Encrypting data in unprotected memory has gained much interest lately for digital rights protection and security reasons. Counter mode is a well-known encryption scheme. It is a symmetric-key encryption scheme based on any block cipher, e.g. AES. The scheme's encryption algorithm uses a block cipher, a secret key and a counter (or a sequence number) to generate an encryption pad which is XORed with the data stored in memory. Like other memory encryption schemes, this method suffers from the inherent latency of decrypting encrypted data when loading them into the on-chip cache. In this paper, we present a novel technique to hide the latency overhead of decrypting counter mode encrypted memory by predicting the sequence number and pre-computing the encryption pad that we call one-time-pad or OTP. In contrast to the prior techniques of sequence number caching, our mechanism solves the latency issue by using idle decryption engine cycles to speculatively predict and pre-compute OTPs before the corresponding sequence number is loaded. This technique incurs very little area overhead. In addition, a novel adaptive OTP prediction technique is also presented to further improve our regular OTP prediction and precomputation mechanism. This adaptive scheme is not only able to predict encryption pads associated with static and infrequently updated cache lines but also those frequently updated ones as well. Experimental results using SPEC2000 benchmark show an 82% prediction rate. Moreover, we also explore several optimization techniques for improving the prediction accuracy. Two specific techniques, two-level prediction and context-based prediction are presented and evaluated. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design and implementation of the AEGIS single-chip secure processor using physical random functions

    Publication Year: 2005 , Page(s): 25 - 36
    Cited by:  Papers (17)  |  Patents (12)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (296 KB) |  | HTML iconHTML  

    Secure processors enable new applications by ensuring private and authentic program execution even in the face of physical attack. In this paper, we present the AEGIS secure processor architecture, and evaluate its RTL implementation on FPGAs. By using physical random functions, we propose a new way of reliably protecting and sharing secrets that is more secure than existing solutions based on non-volatile memory. Our architecture gives applications the flexibility of trusting and protecting only a portion of a given process, unlike prior proposals which require a process to be protected in entirety. We also put forward a specific model of how secure applications can be programmed in a high-level language and compiled to run on our system. Finally, we evaluate a fully functional FPGA implementation of our processor, assess the implementation tradeoffs, compare performance, and demonstrate the benefits of partially protecting a program. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Disk drive roadmap from the thermal perspective: a case for dynamic thermal management

    Publication Year: 2005 , Page(s): 38 - 49
    Cited by:  Papers (15)  |  Patents (37)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (248 KB) |  | HTML iconHTML  

    The importance of pushing the performance envelope of disk drives continues to grow, not just in the server market but also in numerous consumer electronics products. One of the most fundamental factors impacting disk drive design is the heat dissipation and its effect on drive reliability, since high temperatures can cause off-track errors, or even head crashes. Until now, drive manufacturers have continued to meet the 40% annual growth target of the internal data rates (IDR) by increasing RPMs, and shrinking platter sizes, both of which have counter-acting effects on the heat dissipation within a drive. As this paper shows, we are getting to a point where it is becoming very difficult to stay on this roadmap. This paper presents an integrated disk drive model that captures the close relationships between capacity, performance and thermal characteristics over time. Using this model, we quantify the drop off in IDR growth rates over the next decade if we are to adhere to the thermal envelope of drive design. We present two mechanisms for buying back some of this IDR loss with dynamic thermal management (DTM). The first DTM technique exploits any available thermal slack, between what the drive was intended to support and the currently lower operating temperature, to ramp up the RPM. The second DTM technique assumes that the drive is only designed for average case behavior, thus allowing higher RPMs than the thermal envelope, and employs dynamic throttling of disk drive activities to remain within this envelope. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Direct cache access for high bandwidth network I/O

    Publication Year: 2005 , Page(s): 50 - 59
    Cited by:  Papers (14)  |  Patents (6)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (200 KB) |  | HTML iconHTML  

    Recent I/O technologies such as PCI-Express and 10 Gb Ethernet enable unprecedented levels of I/O bandwidths in mainstream platforms. However, in traditional architectures, memory latency alone can limit processors from matching 10 Gb inbound network I/O traffic. We propose a platform-wide method called direct cache access (DCA) to deliver inbound I/O data directly into processor caches. We demonstrate that DCA provides a significant reduction in memory latency and memory bandwidth for receive intensive network I/O applications. Analysis of benchmarks such as SPECWeb9, TPC-W and TPC-C shows that overall benefit depends on the relative volume of I/O to memory traffic as well as the spatial and temporal relationship between processor and I/O memory accesses. A system level perspective for the efficient implementation of DCA is presented. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Deconstructing commodity storage clusters

    Publication Year: 2005 , Page(s): 60 - 71
    Cited by:  Papers (5)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (272 KB) |  | HTML iconHTML  

    The traditional approach for characterizing complex systems is to run standard workloads and measure the resulting performance as seen by the end user. However, unique opportunities exist when characterizing a system that is itself constructed from standardized components: one can also look inside the system itself by instrumenting each of the components. In this paper, we show how intra-box instrumentation can help one understand the behavior of a large-scale storage cluster, the EMC Centera. In our analysis, we leverage standard tools for tracing both the disk and network traffic emanating from each node of the cluster. By correlating this traffic with the running workload, we are able to infer the structure of the software system (e.g., its write update protocol) as well as its policies (e.g., how it performs caching, replication, and load-balancing). Further, by imposing variable intra-box delays on network and disk traffic, we can confirm the causal relationships between network and disk events. Thus, we are able to infer the semantics of the messages between nodes without examining a single line of source code. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A robust main-memory compression scheme

    Publication Year: 2005 , Page(s): 74 - 85
    Cited by:  Papers (16)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (464 KB) |  | HTML iconHTML  

    Lossless data compression techniques can potentially free up more than 50% of the memory resources. However, previously proposed schemes suffer from high access costs. The proposed main-memory compression scheme practically eliminates performance losses of previous schemes by exploiting a simple and yet effective compression scheme, a highly-efficient structure for locating a compressed block in memory, and a hierarchical memory layout that allows compressibility of blocks to vary with a low fragmentation overhead. We have evaluated an embodiment of the proposed scheme in detail using 14 integer and floating point applications from the SPEC2000 suite along with two server applications and we show that the scheme robustly frees up 30% of the memory resources, on average, with a negligible impact on the performance of only 0.2% on average. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Continuous optimization

    Publication Year: 2005 , Page(s): 86 - 97
    Cited by:  Papers (8)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (240 KB) |  | HTML iconHTML  

    This paper presents a hardware-based dynamic optimizer that continuously optimizes an application's instruction stream. In continuous optimization, dataflow optimizations are performed using simple, table-based hardware placed in the rename stage of the processor pipeline. The continuous optimizer reduces dataflow height by performing constant propagation, reassociation, redundant load elimination, store forwarding, and silent store removal. To enhance the impact of the optimizations, the optimizer integrates values generated by the execution units back into the optimization process. Continuous optimization allows instructions with input values known at optimization time to be executed in the optimizer, leaving less work for the out-of-order portion of the pipeline. Continuous optimization can detect branch mispredictions earlier and thus reduce the misprediction penalty. In this paper, we present a detailed description of a hardware optimizer and evaluate it in the context of a contemporary microarchitecture running current workloads. Our analysis of SPECint, SPECfp, and mediabench workloads reveals that a hardware optimizer can directly execute 33% of instructions, resolve 29% of mispredicted branches, and generate addresses for 76% of memory operations. These positive effects combine to provide speed ups in the range 0.99 to 1.27. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • RENO: a rename-based instruction optimizer

    Publication Year: 2005 , Page(s): 98 - 109
    Cited by:  Papers (8)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (264 KB) |  | HTML iconHTML  

    RENO is a modified MIPS R10000 register renamer that uses map-table "short-circuiting" to implement dynamic versions of several well-known static optimizations: move elimination, common subexpression elimination, register allocation, and constant folding. Because it implements these optimizations dynamically, RENO can apply optimizations in certain situations where static compilers cannot. Cycle-level simulation shows that RENO dynamically eliminates (i.e. optimizes away) 22% of the dynamic instructions in both SPECint2000 and MediaBench. RENOCF is responsible for 12% and 17% of the eliminations, respectively. Because dataflow dependences are collapsed around eliminated instructions, performance improves by 8% and 13%, respectively. Alternatively, because eliminated instructions do not consume issue queue entries, physical registers, or issue, bypass, register file, and execution bandwidth, RENO can be used to absorb the performance impact of a significantly scaled-down execution core. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A high throughput string matching architecture for intrusion detection and prevention

    Publication Year: 2005 , Page(s): 112 - 122
    Cited by:  Papers (41)  |  Patents (27)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (216 KB) |  | HTML iconHTML  

    Network intrusion detection and prevention systems have emerged as one of the most effective ways of providing security to those connected to the network, and at the heart of almost every modern intrusion detection system is a string matching algorithm. String matching is one of the most critical elements because it allows for the system to make decisions based not just on the headers, but the actual content flowing through the network. Unfortunately, checking every byte of every packet to see if it matches one of a set of ten thousand strings becomes a computationally intensive task as network speeds grow into the tens, and eventually hundreds, of gigabits/second. To keep up with these speeds a specialized device is required, one that can maintain tight bounds on worst case performance, that can be updated with new rules without interrupting operation, and one that is efficient enough that it could be included on chip with existing network chips or even into wireless devices. We have developed an approach that relies on a special purpose architecture that executes novel string matching algorithms specially optimized for implementation in our design. We show how the problem can be solved by converting the large database of strings into many tiny state machines, each of which searches for a portion of the rules and a portion of the bits of each rule. Through the careful co-design and optimization of our architecture with a new string matching algorithm we show that it is possible to build a system that is 10 times more efficient than the currently best known approaches. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A tree based router search engine architecture with single port memories

    Publication Year: 2005 , Page(s): 123 - 133
    Cited by:  Papers (20)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (296 KB) |  | HTML iconHTML  

    Pipelined forwarding engines are used in core router to meet speed demands. Tree-based searches are pipelined across a number of stages to achieve high throughput, but this results in unevenly distributed memory. To address this imbalance, conventional approaches use either complex dynamic memory allocation schemes or over-provision each of the pipeline stages. This paper describes the microarchitecture of a novel network search processor which provides both high execution throughput and balanced memory distributor by dividing the tree into subtrees and allocating each subtree separately, allowing searches to begin at any pipeline stage. The architecture is validated by implementing and simulating state of the art solutions for IPv4 lookup, VPN forwarding and packet classification. The new pipeline scheme and memory allocator can provide searches with a memory allocation, efficiency that is within 1% of non-pipelined schemes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An integrated memory array processor architecture for embedded image recognition systems

    Publication Year: 2005 , Page(s): 134 - 145
    Cited by:  Papers (16)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (616 KB) |  | HTML iconHTML  

    Embedded processors for video image recognition require to address both the cost (die size and power) versus real-time performance issue, and also to achieve high flexibility due to the immense diversity of recognition targets, situations, and applications. This paper describes IMAP, a highly parallel SIMD linear processor and memory array architecture that addresses these trading-off requirements. By using parallel and systolic algorithmic techniques, despite of its simple architecture IMAP achieves to exploit not only the straightforward per image row data level parallelism (DLP), but also the inherent DLP of other memory access patterns frequently found in various image recognition tasks, under the use of an explicit parallel C language (IDC). We describe and evaluate IMAP-CE, a latest IMAP processor, which integrates 128 of 100MHz 8 bit 4-way VLIW PEs, 128 of 2KByte RAMs, and one 16 bit RISC control processor, into a single chip. The PE instruction set is enhanced for supporting IDC codes. IMAP-CE is evaluated mainly by comparing its performance running IDC codes with that of a 2.4GHz Intel P4 running optimized C codes. Based on the use of parallelizing techniques, benchmark results show a speedup of up to 20 for image filter kernels, and of 4 for a full image recognition application. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design and evaluation of hybrid fault-detection systems

    Publication Year: 2005 , Page(s): 148 - 159
    Cited by:  Papers (19)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (184 KB) |  | HTML iconHTML  

    As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. Up to now, system designers have primarily considered hardware-only and software-only fault-detection mechanisms to identify and mitigate the deleterious effects of transient faults. These two fault-detection systems, however, are extremes in the design space, representing sharp trade-offs between hardware cost, reliability, and performance. In this paper, we identify hybrid hardware/software fault-detection mechanisms as promising alternatives to hardware-only and software-only systems. These hybrid systems offer designers more options to fit their reliability needs within their hardware and performance budgets. We propose and evaluate CRAFT, a suite of three such hybrid techniques, to illustrate the potential of the hybrid approach. For fair, quantitative comparisons among hardware, software, and hybrid systems, we introduce a new metric, mean work to failure, which is able to compare systems for which machine instructions do not represent a constant unit of work. Additionally, we present a new simulation framework which rapidly assesses reliability and does not depend on manual identification of failure modes. Our evaluation illustrates that CRAFT, and hybrid techniques in general, offer attractive options in the fault-detection design space. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Rescue: a microarchitecture for testability and defect tolerance

    Publication Year: 2005 , Page(s): 160 - 171
    Cited by:  Papers (14)  |  Patents (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (160 KB) |  | HTML iconHTML  

    Scaling feature size improves processor performance but increases each device's susceptibility to defects (i.e., hard errors). As a result, fabrication technology must improve significantly to maintain yields. Redundancy techniques in memory have been successful at improving yield in the presence of defects. Apart from core sparing which disables faulty cores in a chip multiprocessor, little has been done to target the core logic. While previous work has proposed that either inherent or added redundancy in the core logic can be used to tolerate defects, the key issues of realistic testing and fault isolation have been ignored. This paper is the first to consider testability and fault isolation in designing modern high-performance, defect-tolerant microarchitectures. We define intra-cycle logic independence (ICI) as the condition needed for conventional scan test to isolate faults quickly to the microarchitectural-block granularity. We propose logic transformations to redesign conventional superscalar microarchitecture to comply with ICI. We call our novel, testable, and defect-tolerant microarchitecture Rescue. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Opportunistic transient-fault detection

    Publication Year: 2005 , Page(s): 172 - 183
    Cited by:  Papers (25)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (160 KB) |  | HTML iconHTML  

    CMOS scaling increases susceptibility of microprocessors to transient faults. Most current proposals for transient-fault detection use full redundancy to achieve perfect coverage while incurring significant performance degradation. However, most commodity systems do not need or provide perfect coverage. A recent paper explores this leniency to reduce the soft-error rate of the issue queue during L2 misses while incurring minimal performance degradation. Whereas the previous paper reduces soft-error rate without using any redundancy, we target better coverage while incurring similarly-minimal performance degradation by opportunistically using redundancy. We propose two semi-complementary techniques, called partial explicit redundancy (PER) and implicit redundancy through reuse (IRTR), to explore the trade-off between soft-error rate and performance. PER opportunistically exploits low-ILP phases and L2 misses to introduce explicit redundancy with minimal performance degradation. Because PER covers the entire pipeline and exploits not only L2 misses but all low-ILP phases, PER achieves better coverage than the previous work. To achieve coverage in high-ILP phases as well, we propose implicit redundancy through reuse (IRTR). Previous work exploits the phenomenon of instruction reuse to avoid redundant execution while falling back on redundant execution when there is no reuse. IRTR takes reuse to the extreme of performance-coverage trade-off and completely avoids explicit redundancy by exploiting reuse's implicit redundancy within the main thread for fault detection with virtually no performance degradation. Using simulations with SPEC2000, we show that PER and IRTR achieve better tradeoff between soft-error rate and performance degradation than the previous schemes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An evaluation framework and instruction set architecture for ion-trap based quantum micro-architectures

    Publication Year: 2005 , Page(s): 186 - 196
    Cited by:  Papers (10)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (328 KB) |  | HTML iconHTML  

    The theoretical study of quantum computation has yielded efficient algorithms for some traditionally hard problems. Correspondingly, experimental work on the underlying physical implementation technology has progressed steadily. However, almost no work has yet been done which explores the architecture design space of large scale quantum computing systems. In this paper, we present a set of tools that enable the quantitative evaluation of architectures for quantum computers. The infrastructure we created comprises a complete compilation and simulation system for computers containing thousands of quantum bits. We begin by compiling complete algorithms into a quantum instruction set. This ISA enables the simple manipulation of quantum state. Another tool we developed automatically transforms quantum software into an equivalent, fault-tolerant version required to operate on real quantum devices. Next, our infrastructure transforms the ISA into a set of low-level micro architecture specific control operations. In the future, these operations can be used to directly control a quantum computer. For now, our simulation framework quickly uses them to determine the reliability of the application for the target micro architecture. Finally, we propose a simple, regular architecture for ion-trap based quantum computers. Using our software infrastructure, we evaluate the design trade offs of this micro architecture. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Energy optimization of subthreshold-voltage sensor network processors

    Publication Year: 2005 , Page(s): 197 - 207
    Cited by:  Papers (29)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (272 KB) |  | HTML iconHTML  

    Sensor network processors and their applications are a growing area of focus in computer system research and design. Inherent to this design space is a reduced processing performance requirement and extremely high energy constraints, such that sensor network processors must execute low-performance tasks for long durations on small energy supplies. In this paper, we demonstrate that subthreshold-voltage circuit design (400 mV and below) lends itself well to the performance and energy demands of sensor network processors. Moreover, we show that the landscape for microarchitectural energy optimization dramatically changes in the subthreshold domain. The dominance of leakage power in the subthreshold regime demands architectures that i) reduce overall area; ii) increase the utility of transistors; while iii) maintaining acceptable CPI efficiency. We confirm these observations by performing SPICE-level analysis of 21 sensor network processors and memory architectures. Our best sensor platform, implemented in 130nm CMOS and operating at 235 mV, only consumes 1.38 pJ/instruction, nearly an order of magnitude less energy than previously published sensor network processor results. This design, accompanied by bulk-silicon solar cells for energy scavenging, has been manufactured by IBM and is currently being tested. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.