Scheduled System Maintenance on May 29th, 2015:
IEEE Xplore will be upgraded between 11:00 AM and 10:00 PM EDT. During this time there may be intermittent impact on performance. We apologize for any inconvenience.
By Topic

Computers, IEEE Transactions on

Issue 2 • Date Feb. 1999

Filter Results

Displaying Results 1 - 18 of 18
  • Guest Editors' Introduction Cache Memory And Related Problems: Enhancing And Exploiting The Locality

    Publication Year: 1999 , Page(s): 97 - 99
    Cited by:  Papers (2)
    Save to Project icon | Request Permissions | PDF file iconPDF (90 KB)  
    Freely Available from IEEE
  • Automatic compiler-inserted prefetching for pointer-based applications

    Publication Year: 1999 , Page(s): 134 - 141
    Cited by:  Papers (9)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (668 KB)  

    As the disparity between processor and memory speeds continues to grow, memory latency is becoming an increasingly important performance bottleneck. While software-controlled prefetching is an attractive technique for tolerating this latency, its success has been limited thus far to array-based numeric codes. In this paper, we expand the scope of automatic compiler-inserted prefetching to also include the recursive data structures commonly found in pointer-based applications. We propose three compiler-based prefetching schemes, and automate the most widely applicable scheme (greedy prefetching) in an optimizing research compiler. Our experimental results demonstrate that compiler-inserted prefetching can offer significant performance gains on both uniprocessors and large-scale shared-memory multiprocessors View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Improving cache locality by a combination of loop and data transformations

    Publication Year: 1999 , Page(s): 159 - 167
    Cited by:  Papers (31)  |  Patents (2)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (284 KB)  

    Exploiting locality of reference is key to realizing high levels of performance on modern processors. This paper describes a compiler algorithm for optimizing cache locality in scientific codes on uniprocessor and multiprocessor machines. A distinctive characteristic of our algorithm is that it considers loop and data layout transformations in a unified framework. Our approach is very effective at reducing cache misses and can optimize some nests for which optimization techniques based on loop transformations alone are not successful. An important special case is one in which data layouts of some arrays are fixed and cannot be changed. We show how our algorithm can accommodate this case and demonstrate how it can be used to optimize multiple loop nests. Experiments on several benchmarks show that the techniques presented in this paper result in substantial improvement in cache performance View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An executable specification and verifier for relaxed memory order

    Publication Year: 1999 , Page(s): 227 - 235
    Cited by:  Papers (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (204 KB)  

    The Murψ description language and verification system for finite-state concurrent systems is applied to the problem of specifying a family of multiprocessor memory models described in the SPARC Version 9 architecture manual. The description language allows for a straightforward operational description of the memory model which can be used as a specification for programmers and machine architects. The automatic verifier can be used to generate all possible outcomes of small assembly language multiprocessor programs in a given memory model, which is very helpful for understanding the subtleties of the model. The verifier can also check the correctness of assembly language programs including synchronization routines. This paper describes the memory models and their encoding in the Murψ description language. We describe how synchronization routines can be verified and how finite state programs can be analyzed. We also present some interesting findings from the verification and the analysis View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Excel-NUMA: toward programmability, simplicity, and high performance

    Publication Year: 1999 , Page(s): 256 - 264
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (284 KB)  

    While hardware-coherent scalable shared-memory multiprocessors are relatively easy to program, they still require substantial programming effort to deliver high performance. Specifically, to minimize remote accesses, data must be carefully laid out in memory for locality and application working sets carefully tuned for caches. It has been claimed that this programming effort is less necessary in hardware COMA machines like Flat-COMA thanks to automatic line-based data migration. Unfortunately, Flat-COMA is complex to design. Consequently, we would like a machine as programmable as Flat-COMA, as simple as plain CC-NUMA, and that outperforms both. This paper presents our proposal: Excel-NUMA (EX-NUMA). The idea is to exploit the fact that, after a memory line is written and cached, the storage that kept the line in memory is unutilized. We use that storage to temporarily hold remote data displaced from the local caches. This enables automatic data migration, like in Flat-COMA, enhancing programmability. The hardware required to manage the system is a simple, local module added to a CC-NUMA; the global cache coherence protocol is not changed. Simulations of Splash2 applications show that EX-NUMA outperforms CC-NUMA and Flat-COMA in every single application and eliminates most of the conflict misses View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Effects of multithreading on cache performance

    Publication Year: 1999 , Page(s): 176 - 184
    Cited by:  Papers (5)  |  Patents (15)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (240 KB)  

    As the performance gap between processor and memory grows, memory latency becomes a major bottleneck in achieving high processor utilization. Multithreading has emerged as one of the most promising and exciting techniques used to tolerate memory latency by exploiting thread-level parallelism. The question, however, remains as to how effective multithreading is on tolerating memory latency. The performance of multithreading is not only affected by the overlapping of memory latency with useful computation, but also strongly depends on the cache behavior and the overhead of multithreading (e.g., thread management and context-switch costs). In particular, multithreading affects the behavior of caches, and, thus, the overall performance in a nontrivial fashion. To study these issues, this paper presents the Multithreaded Virtual Processor (MVP) model. MVP integrates the multithreaded programming paradigm and a modern superscalar processor with support for fast context switching and thread scheduling. Our studies with MVP show that, in general, the performance improvements are obtained not only by tolerating memory latency but also lower cache miss rates due to exploitation of data locality. However, multithreading creates an additional stress on the memory hierarchy caused by the interference among threads. Also, the dynamic behavior of multithreaded execution hinders the instruction locality that results in a high number of misses in the L1 instruction cache View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Coherence controller architectures for scalable shared-memory multiprocessors

    Publication Year: 1999 , Page(s): 245 - 255
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (864 KB)  

    Scalable distributed shared-memory architectures rely on coherence controllers on each processing node to synthesize cache-coherent shared memory across the entire machine. The coherence controllers execute coherence protocol handlers that may be hardwired in custom hardware or programmed in a protocol processor within each coherence controller. Although custom hardware runs faster, a protocol processor allows the coherence protocol to be tailored to specific application needs and may shorten hardware development time. Previous research shows minimal increase in application execution time due to protocol processors over custom hardware. With the advent of SMP nodes and faster processors and networks, the trade-off between custom hardware and protocol processors needs to be reexamined. This paper studies the performance of custom hardware and protocol-processor-based coherence controllers in SMP-node-based CC-NUMA systems on applications from the SPLASH-2 suite. Using realistic parameters and detailed models of state-of-the-art system components, it shows that the occupancy of coherence controllers can limit the performance of applications with high communication requirements, where the execution time using commodity protocol processors can be twice as long as using custom hardware. We also investigate the effect of varying several architectural parameters that influence the communication characteristics of the applications and the underlying system on coherence controller performance. We identify measures of applications' communication requirements and their impact on performance. We also study the potential of improving the performance of coherence controllers by separating or duplicating critical components View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The impact of exploiting instruction-level parallelism on shared-memory multiprocessors

    Publication Year: 1999 , Page(s): 218 - 226
    Cited by:  Papers (3)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (492 KB)  

    Current microprocessors incorporate techniques to aggressively exploit instruction-level parallelism (ILP). This paper evaluates the impact of such processors on the performance of shared-memory multiprocessors, both without and with the latency-hiding optimization of software prefetching. Our results show that, while ILP techniques substantially reduce CPU time in multiprocessors, they are less effective in removing memory stall time. Consequently, despite the inherent latency tolerance features of ILP processors, we find memory system performance to be a larger bottleneck and parallel efficiencies to be generally poorer in ILP-based multiprocessors than in previous generation multiprocessors. The main reasons for these deficiencies are insufficient opportunities in the applications to overlap multiple load misses and increased contention for resources in the system. We also find that software prefetching does not change the memory bound nature of most of our applications on our ILP multiprocessor, mainly due to a large number of late prefetches and resource contention. Our results suggest the need for additional latency hiding or reducing techniques for ILP systems, such as software clustering of load misses and producer-initiated communication View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A quantitative analysis of the performance and scalability of distributed shared memory cache coherence protocols

    Publication Year: 1999 , Page(s): 205 - 217
    Cited by:  Papers (6)  |  Patents (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (392 KB)  

    Scalable cache coherence protocols have become the key technology for creating moderate to large-scale shared-memory multiprocessors. Although the performance of such multiprocessors depends critically on the performance of the cache coherence protocol, little comparative performance data is available. Existing commercial implementations use a variety of different protocols, including bit-vector/coarse-vector protocols, SCI-based protocols, and COMA protocols. Using the programmable protocol processor of the Stanford FLASH multiprocessor, we provide a detailed, implementation-oriented evaluation of four popular cache coherence protocols. In addition to measurements of the characteristics of protocol execution (e.g., memory overhead, protocol execution time, and message count) and of overall performance, we examine the effects of scaling the processor count from 1 to 128 processors. Surprisingly, the optimal protocol changes for different applications and can change with processor count even within the same application. These results help identify the strengths of specific protocols and illustrate the benefits of providing flexibility in the choice of cache coherence protocol View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Prefetching using Markov predictors

    Publication Year: 1999 , Page(s): 121 - 133
    Cited by:  Papers (35)  |  Patents (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (428 KB)  

    Prefetching is one approach to reducing the latency of memory operations in modern computer systems. In this paper, we describe the Markov prefetcher. This prefetcher acts as an interface between the on-chip and off-chip cache and can be added to existing computer designs. The Markov prefetcher is distinguished by prefetching multiple reference predictions from the memory subsystem, and then prioritizing the delivery of those references to the processor. This design results in a prefetching system that provides good coverage, is accurate, and produces timely results that can be effectively used by the processor. We also explored a range of techniques that can be used to reduce the bandwidth demands of prefetching, leading to improved memory system performance. In our cycle-level simulations, the Markov Prefetcher reduces the overall execution stalls due to instruction and data memory operations by an average of 54 percent for various commercial benchmarks while only using two-thirds the memory of a demand-fetch cache organization View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting the benefits of multiple-path network in DSM systems: architectural alternatives and performance evaluation

    Publication Year: 1999 , Page(s): 236 - 244
    Cited by:  Papers (1)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (680 KB)  

    Modern high-performance networks being used for scalable distributed shared-memory (DSM) systems support multiple paths to increase bandwidth and/or reduce contention. Such networks violate the constraint of pairwise in-order message delivery implicitly required by many existing directory-based cache coherence protocols. To solve this problem, two alternative strategies are currently used by computer architects. The first strategy, used in the SGI Origin series, is to employ an intelligent cache coherence protocol which detects and resolves all race conditions caused by out-of-order (OoO) events. The second strategy, used in the HAL Mercury series, is to use a sophisticated network interface (NI) which detects and remedies every OoO event before the messages are fed to the cache coherence controllers. Both strategies involve complicated hardware logic, either at the cache coherence controller level or at the NI level. In this paper, we propose a new strategy that uses block correlated FIFO channels. This new strategy detects all potential race conditions and prevents them from occurring. It allows the use of a simple cache coherence protocol and an inexpensive NI. We also present an efficient implementation of this strategy based on current technology. Detailed simulations are performed using benchmark applications to evaluate the performance of our new strategy. The results indicate that, compared to the existing strategies, our new strategy always provides either the best or close to the best overall performance. This study also provides valuable insights into the design trade-offs in incorporating modern networks into DSM systems View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Randomized cache placement for eliminating conflicts

    Publication Year: 1999 , Page(s): 185 - 192
    Cited by:  Papers (18)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (240 KB)  

    Applications with regular patterns of memory access can experience high levels of cache conflict misses. In shared-memory multiprocessors conflict misses can be increased significantly by the data transpositions required for parallelization. Techniques such as blocking which are introduced within a single thread to improve locality, can result in yet more conflict misses. The tension between minimizing cache conflicts and the other transformations needed for efficient parallelization leads to complex optimization problems for parallelizing compilers. This paper shows how the introduction of a pseudorandom element into the cache index function can effectively eliminate repetitive conflict misses and produce a cache where miss ratio depends solely on working set behavior. We examine the impact of pseudorandom cache indexing on processor cycle times and present practical solutions to some of the major implementation issues for this type of cache. Our conclusions are supported by simulations of a superscalar out-of-order processor executing the SPEC95 benchmarks, as well as from cache simulations of individual loop kernels to illustrate specific effects. We present measurements of instructions committed per cycle (IPC) when comparing the performance of different cache architectures on whole-program benchmarks such as the SPEC95 suite View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An algorithm for optimally exploiting spatial and temporal locality in upper memory levels

    Publication Year: 1999 , Page(s): 150 - 158
    Cited by:  Papers (8)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (188 KB)  

    In this study, we present an extension of Belady's MIN algorithm that optimally and simultaneously exploits spatial and temporal locality. Thus, this algorithm provides a performance upper bound of upper memory levels. The purpose of this algorithm is to assess current memory optimizations and to evaluate the potential benefits of future optimizations. We formally prove the optimality of this new algorithm with respect to minimizing misses and we show experimentally that the algorithm produces nearly minimum memory traffic on the SPEC95 benchmarks View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Augmenting loop tiling with data alignment for improved cache performance

    Publication Year: 1999 , Page(s): 142 - 149
    Cited by:  Papers (23)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (660 KB)  

    Loop blocking (tiling) is a well-known compiler optimization that helps improve cache performance by dividing the loop iteration space into smaller blocks (tiles); reuse of array elements within each tile is maximized by ensuring that the working set for the tile fits into the data cache. Padding is a data alignment technique that involves the insertion of dummy elements into a data structure for improving cache performance. In this work, we present DAT, a technique that augments loop tiling with data alignment, achieving improved efficiency (by ensuring that the cache is never under-utilized) as well as improved flexibility (by eliminating self-interference cache conflicts independent of the tile size). This results in a more stable and better cache performance than existing approaches, in addition to maximizing cache utilization, eliminating self-interference, and minimizing cross-interference conflicts. Further, while all previous efforts are targeted at programs characterized by the reuse of a single array, we also address the issue of minimizing conflict misses when several tiled arrays are involved. To validate our technique, we ran extensive experiments using both simulations as well as actual measurements on SUN Sparc5 and Sparc10 workstations. The results on benchmarks exhibiting varying memory access patterns demonstrate the effectiveness of our technique through consistently high hit ratios and improved performance across varying problem sizes View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A trace cache microarchitecture and evaluation

    Publication Year: 1999 , Page(s): 111 - 120
    Cited by:  Papers (25)  |  Patents (7)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (788 KB)  

    As the instruction issue width of superscalar processors increases, instruction fetch bandwidth requirements will also increase. It will eventually become necessary to fetch multiple basic blocks per clock cycle. Conventional instruction caches hinder this effort because long instruction sequences are not always in contiguous cache locations. Trace caches overcome this limitation by caching traces of the dynamic instruction stream, so instructions that are otherwise noncontiguous appear contiguous. In this paper, we present and evaluate a microarchitecture incorporating a trace cache. The microarchitecture provides high instruction fetch bandwidth with low latency by explicitly sequencing through the program at the higher level of traces, both in terms of (1) control flow prediction acid (2) instruction supply. For the SPEC95 integer benchmarks, trace-level sequencing improves performance from 15 percent to 35 percent over an otherwise equally sophisticated, but contiguous, multiple-block fetch mechanism. Most of this performance improvement is due to the trace cache. However, for one benchmark whose performance is limited by branch mispredictions, the performance gain is almost entirely due to improved prediction accuracy View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Functional implementation techniques for CPU cache memories

    Publication Year: 1999 , Page(s): 100 - 110
    Cited by:  Papers (4)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (728 KB)  

    As the performance gap between processors and main memory continues to widen, increasingly aggressive implementations of cache memories are needed to bridge the gap. In this paper, we consider some of the issues that are involved in the implementation of highly optimized cache memories and survey the techniques that can be used to help achieve the increasingly stringent design targets and constraints of modern processors. In particular, we consider techniques that enable the cache to be accessed quickly and still achieve a good hit ratio. We also consider issues such as area cost and bandwidth requirements. Trace-driven simulations of a TPC-C-like workload and selected applications from the SPEC95 benchmark suite are used in the paper to compare the performance of some of the techniques View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Evaluation of design options for the trace cache fetch mechanism

    Publication Year: 1999 , Page(s): 193 - 204
    Cited by:  Papers (10)  |  Patents (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (948 KB)  

    In this paper, we examine some critical design features of a trace cache fetch engine for a 16-wide issue processor and evaluate their effects on performance. We evaluate path associativity, partial matching, and inactive issue, all of which are straightforward extensions to the trace cache. We examine features such as the fill unit and branch predictor design. In our final analysis, we show that the trace cache mechanism attains a 28 percent performance improvement over an aggressive single block fetch mechanism and a 15 percent improvement over a sequential multiblock mechanism View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis of temporal-based program behavior for improved instruction cache performance

    Publication Year: 1999 , Page(s): 168 - 175
    Cited by:  Papers (5)
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (232 KB)  

    In this paper, we examine temporal-based program interaction in order to improve layout by reducing the probability that program units will conflict in an instruction cache. In that context, we present two profile-guided procedure reordering algorithms. Both techniques use cache line coloring to arrive at a final program layout and target the elimination of first generation cache conflicts (i.e., conflicts between caller/callee pairs). The first algorithm builds a call graph that records local temporal interaction between procedures. We will describe how the call graph is used to guide the placement step and present methods that accelerate cache line coloring by exploring aggressive graph pruning techniques. In the second approach, we capture global temporal program interaction by constructing a Conflict Miss Graph (CMG). The CMG estimates the worst-case number of misses two competing procedures can inflict upon one another and reducing higher generation cache conflicts. We use a pruned CMG graph to guide cache line coloring. Using several C and C++ benchmarks, we show the benefits of letting both types of graphs guide procedure reordering to improve instruction cache hit rates. To contrast the differences between these two forms of temporal interaction, we also develop new characterization streams based on the Inter-Reference Gap (IRG) model View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Paolo Montuschi
Politecnico di Torino
Dipartimento di Automatica e Informatica
Corso Duca degli Abruzzi 24 
10129 Torino - Italy
e-mail: pmo@computer.org