Scheduled System Maintenance on May 29th, 2015:
IEEE Xplore will be upgraded between 11:00 AM and 10:00 PM EDT. During this time there may be intermittent impact on performance. We apologize for any inconvenience.
By Topic

Computer Architecture, 1995. Proceedings., 22nd Annual International Symposium on

Date 22-24 June 1995

Filter Results

Displaying Results 1 - 25 of 39
  • Proceedings 22nd Annual International Symposium on Computer Architecture

    Publication Year: 1995
    Save to Project icon | Request Permissions | PDF file iconPDF (169 KB)  
    Freely Available from IEEE
  • The MIT Alewife machine: architecture and performance

    Publication Year: 1995 , Page(s): 2 - 13
    Cited by:  Papers (4)  |  Patents (18)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1425 KB)  

    Alewife is a multiprocessor architecture that supports up to 512 processing nodes connected over a scalable and cost-effective mesh network at a constant cost per node. The MIT Alewife machine, a prototype implementation of the architecture, demonstrates that a parallel system can be both scalable and programmable. Four mechanisms combine to achieve these goals: software-extended coherent shared memory provides a global, linear address space; integrated message passing allows compiler and operating system designers to provide efficient communication and synchronization; support for fine-grain computation allows many processors to cooperate on small problem sizes; and latency tolerance mechanisms-including block multithreading and prefetching-mask unavoidable delays due to communication. Microbenchmarks, together with over a dozen complete applications running on the 32-node prototype, help to analyze the behavior of the system. Analysis shows that integrating message passing with shared memory enables a cost-efficient solution to the cache coherence problem and provides a rich set of programming primitives. Block multithreading and prefetching improve performance by up to 25%, individually, and 35% together. Finally, language constructs that allow programmers to express fine-grain synchronization can improve performance by over a factor of two. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The EM-X parallel computer: architecture and basic performance

    Publication Year: 1995 , Page(s): 14 - 23
    Cited by:  Patents (1)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1015 KB)  

    Latency tolerance is essential in achieving high performance on parallel computers for remote function calls and fine-grained remote memory accesses. EM-X supports interprocessor communication on an execution pipeline with small and simple packets. It can create a packet in one cycle, and receive a packet from the network in the on-chip buffer without interruption. EM-X invokes threads on packet arrival, minimizing the overhead of thread switching. It can tolerate communication latency by using efficient multi-threading and optimizing packet flow of fine grain communication. EM-X also supports the synchronization of two operands, direct remote memory read/write operations and flexible packet scheduling with priority. The paper describes distinctive features of the EM-X architecture and reports the performance of small synthetic programs and larger more realistic programs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The SPLASH-2 programs: characterization and methodological considerations

    Publication Year: 1995 , Page(s): 24 - 36
    Cited by:  Papers (91)  |  Patents (7)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1666 KB)  

    The SPLASH-2 suite of parallel applications has recently been released to facilitate the study of centralized and distributed shared-address-space multiprocessors. In this context, the paper has two goals. One is to quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well. The properties we study include the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality, as well as how these properties scale with problem size and the number of processors. The other, related goal is methodological: to assist people who will use the programs in architectural evaluations to prune the space of application and machine parameters in an informed and meaningful way. For example, by characterizing the working sets of the applications, we describe which operating points in terms of cache size and problem size are representative of realistic situations, which are not, and which re redundant. Using SPLASH-2 as an example, we hope to convey the importance of understanding the interplay of problem size, number of processors, and working sets in designing experiments and interpreting their results. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient strategies for software-only directory protocols in shared-memory multiprocessors

    Publication Year: 1995 , Page(s): 38 - 47
    Cited by:  Papers (1)  |  Patents (3)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1250 KB)  

    The cost, complexity, and inflexibility of hardware-based directory protocols motivate us to study the performance implications of protocols that emulate directory management using software handlers executed on the compute processors. An important performance limitation of such software-only protocols is that software latency associated with directory management ends up on the critical memory access path for read miss transactions. We propose five strategies that support efficient data transfers in hardware whereas directory management is handled at a slower pace in the background by software handlers. Simulations show that this approach can remove the directory-management latency from the memory access path. Whereas the directory is managed in software, the hardware mechanisms must access the memory state in order to enable data transfers at a high speed. Overall, our strategies reach between 60% and 86% of the hardware-based protocol performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors

    Publication Year: 1995 , Page(s): 48 - 59
    Cited by:  Papers (17)  |  Patents (6)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1327 KB)  

    The paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a processor automatically invalidate its local copy of a cache block before a conflicting access by another processor. Eliminating invalidation overhead is particularly important under sequential consistency: where the latency of invalidating outstanding copies can increase a program's critical path. DSI is applicable to software, hardware, and hybrid coherence schemes. We evaluate DSI in the context of hardware directory-based write-invalidate coherence protocols. Our results show that DSI reduces execution time of a sequentially consistent full-map coherence protocol by as much as 41%. This is comparable to an implementation of weak consistency that uses a coalescing write-buffer to allow up to 16 outstanding requests for exclusive blocks. When used in conjunction with weak consistency DSI can exploit tear-off blocks-which eliminate both invalidation and acknowledgment messages-for a total reduction in messages of up to 26%. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Boosting the performance of hybrid snooping cache protocols

    Publication Year: 1995 , Page(s): 60 - 69
    Cited by:  Papers (2)  |  Patents (4)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1196 KB)  

    Previous studies of bus-based shared-memory multiprocessors have shown hybrid write-invalidate/write-update snooping protocols to be incapable of providing consistent performance improvements over write-invalidate protocols. We analyze the deficiencies of hybrid snooping protocols under release consistency, and show how these deficiencies can be dramatically reduced by using write caches and read snarfing. Our performance evaluation is based on program-driven simulation and a set of five scientific applications with different sharing behaviors including migratory sharing as well as producer-consumer sharing. We show that a hybrid protocol, extended with write caches as well as read snarfing, manages to reduce the number of coherence misses by between 83% and 95% as compared to a write-invalidate protocol for all five applications in this study. In addition the number of bus transactions is reduced by between 36% and 60% for four of the applications and by 9% for the fifth application. Because of the small implementation cost of the hybrid protocol and the two extensions, we believe that this combination is an effective approach to boost the performance of bus-based multiprocessors. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • S-Connect: from networks of workstations to supercomputer performance

    Publication Year: 1995 , Page(s): 71 - 82
    Cited by:  Papers (5)  |  Patents (4)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1342 KB)  

    S-Connect is a new high speed, scalable interconnect system that has been developed to support networks of workstations to efficiently share computing resources. It uses off-the-shelf CMOS technology to directly drive fiber-optic systems at speeds greater than 1 Gbit/sec and can realize bisection bandwidths comparable to high-end MPP systems while being >10x more cost-effective. S-Connect systems do not rely on centralized switches, but rather are composed of adaptive, topology independent routing elements that are integrated into each node. The S-Connect routing algorithm is optimized for fine grained, irregular traffic and is designed to support high traffic loads, that can utilize most of the physically available bandwidth. Such traffic is typical of a distributed shared memory system, which is one of the intended applications. S-Connect innovations include a novel distributed phase locking method that allows global synchronization, HW support for multiple message priorities, in-band monitoring and control facilities, and a low overhead channel protocol that supports multiple in-transit messages on the same fiber. The first version of the S-Connect switching element has been successfully, implemented in a commercial 0.65 /spl mu/m CMOS process. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Destage algorithms for disk arrays with non-volatile caches

    Publication Year: 1995 , Page(s): 83 - 95
    Cited by:  Papers (2)  |  Patents (13)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1556 KB)  

    In a disk array with a nonvolatile write cache, destages from the cache to the disk are performed in the background asynchronously while read requests from the host system are serviced in the foreground. We study a number of algorithms for scheduling destages in a RAID-5 system. We introduce a new scheduling algorithm, called linear threshold scheduling, that adaptively varies the rate of destages to disks based on the instantaneous occupancy of the write cache. The performance of the algorithm is compared with that of a number of alternative scheduling approaches such as least-cost scheduling and high/low mark. The algorithms are evaluated in terms of their effectiveness in making destages transparent to the servicing of read requests from the host, disk utilization, and their ability to tolerate bursts in the workload without causing an overflow of the write cache. Our results show that linear threshold scheduling provides the best read performance of all the algorithms compared, while still maintaining a high degree of burst tolerance. An approximate implementation of the linear-threshold scheduling algorithm is also described. The approximate algorithm can be implemented with much lower overhead, yet its performance is virtually identical to that of the ideal algorithm. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Evaluating multi-port frame buffer designs for a mesh-connected multicomputer

    Publication Year: 1995 , Page(s): 96 - 105
    Cited by:  Papers (2)  |  Patents (1)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (961 KB)  

    Multicomputers can be effectively used for interactive graphics rendering only if there are mechanisms available to rapidly composite and transfer images to an external display device. One method for achieving the necessary bandwidth for this operation is to provide multiple high-bandwidth ports into a frame buffer. In this paper, we evaluate the design space of a multiport frame buffer design for the Intel Paragon mesh routing network. We use an instrumented rendering system to capture the graphics operations needed for rendering a number of three-dimensional scenes; we then use those workloads as input to test programs running on the Paragon to estimate the performance of our hardware. Our experiments consider three major design questions: how many network ports the frame buffer needs, whether Z-buffering should be done in hardware on the frame buffer or in software on the computing nodes, and whether the design alternatives are scalable. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Are crossbars really dead? The case for optical multiprocessor interconnect systems

    Publication Year: 1995 , Page(s): 106 - 115
    Cited by:  Papers (4)  |  Patents (25)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1130 KB)  

    Crossbar switches are rarely considered for large, scalable multiprocessor interconnect systems because they require O(n/sup 2/) switching elements, are difficult to control efficiently and are hard to implement once their size becomes too large to fit on one integrated circuit. However these problems are technology dependent and a recent innovation in fiber optic devices has led to a new implementation of crossbar switches that does not share these problems while retaining the full advantages of a crossbar switch: low latency, high throughput, complete connectivity and multicast capability. Moreover, this new technology has several characteristics that allow a distributed control system which scales linearly in the number of attached nodes. The innovation that led to this research is an optical and- gate that can be used to demultiplex multiple high speed data streams that are carried on one common optical medium. Optical time domain multiplexing can combine the data from many nodes and broadcast the result back to all nodes. This paper discusses OTDM technology only to the extent necessary to understand its characteristics and capabilities. The main contribution lies in the description and analysis of interconnect architectures that utilize OTDM to achieve a level performance that is beyond electronic means. It is expected that cost-reduced OTDM systems will become competitive with the next generation of interconnect systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploring configurations of functional units in an out-of-order superscalar processor

    Publication Year: 1995 , Page(s): 117 - 125
    Cited by:  Papers (3)  |  Patents (1)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (902 KB)  

    This study has been carried our in order to determine cost-effective configurations of functional units for multiple-issue out-of-order superscalar processors. The trace-driven simulations were performed on the six integer and the fourteen floating-point programs from the SPEC 92 suite. We first evaluate the number of instructions allowed to be concurrently processed by the execution stages of the pipeline. We then apply some restrictions on the execution issue of different instruction classes in order to define these configurations. We conclude that five to nine functional units are necessary to exploit instruction-level parallelism. An important point is that several data cache ports are required in a processor of degree 4 or more. Finally, we report on complementary results on the utilization rate of the functional units. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Unconstrained speculative execution with predicated state buffering

    Publication Year: 1995 , Page(s): 126 - 137
    Cited by:  Papers (1)  |  Patents (24)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1420 KB)  

    Speculative execution is execution of instructions before it is known whether these instructions should be executed. Compiler-based speculative execution has the potential to achieve both a high instruction per cycle rate and high clock rate. Pure compiler-based approaches, however have greatly limited instruction scheduling due to a limited ability to handle side effects of speculative execution. Significant performance improvement is, thus, difficult in non-numerical applications. This paper proposes a new architectural mechanism, called predicating, which provides unconstrained speculative execution. Predicating removes restrictions which limit the compiler's ability to schedule instructions. Through our hardware support, the compiler is allowed to move instructions past multiple basic block boundaries from any succeeding control path. Predicating buffers the side effects of speculative execution with its predicate, and the buffered predicate efficiently commits or squashes the side effects. The mechanism also provides a speculative exception handling scheme. The scheme, called the future condition properly postpones speculative exceptions and efficiently restarts the process. We show that our mechanism can be implemented through a modest amount of hardware with little complexity. The evaluation results show that our mechanism significantly improves performane, and achieves a 2.45x speedup over scalar machines. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A comparison of full and partial predicated execution support for ILP processors

    Publication Year: 1995 , Page(s): 138 - 149
    Cited by:  Papers (15)  |  Patents (22)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1406 KB)  

    One can effectively utilize predicated execution to improve branch handling in instruction-level parallel processors. Although the potential benefits of predicated execution are high, the tradeoffs involved in the design of an instruction set to support predicated execution can be difficult. On one end of the design spectrum, architectural support for full predicated execution requires increasing the number of source operands for all instructions. Full predicate support provides for the most flexibility and the largest potential performance improvements. On the other end, partial predicated execution support, such as conditional moves, requires very little change to existing architectures. This paper presents a preliminary study to qualitatively and quantitatively address the benefit of full and partial predicated execution support. With our current compiler technology, we show that the compiler can use both partial and full predication to achieve speedup in large control-intensive programs. Some details of the code generation techniques are shown to provide insight into the benefit of going from partial to full predication. Preliminary experimental results are very encouraging: partial predication provides an average of 33% performance improvement for an 8-issue processor with no predicate support while full predication provides an additional 30% improvement. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementation trade-offs in using a restricted data flow architecture in a high performance RISC microprocessor

    Publication Year: 1995 , Page(s): 151 - 162
    Cited by:  Papers (3)  |  Patents (3)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1028 KB)  

    The implementation of a superscalar, speculative execution SPARC-V9 microprocessor incorporating restricted data flow principles required many design trade-offs. Consideration was given to both performance and cost. Performance is largely a function of cycle time and instructions executed per cycle while cost is primarily a function of die area. Here we describe our restricted data flow implementation and the means with which we arrived at its configuration. Future semi-conductor technology advances will allow these trade-offs to be relaxed and higher performance restricted data flow machines to be built. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance evaluation of the PowerPC 620 microarchitecture

    Publication Year: 1995 , Page(s): 163 - 174
    Cited by:  Papers (7)  |  Patents (5)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1288 KB)  

    The PowerPC 620 microprocessor is the most recent and performance leading member of the PowerPC family. The 64-bit PowerPC 620 microprocessor employs a two-phase branch prediction scheme, dynamic renaming for all the register files, distributed multi-entry reservation stations, true out-of-order execution by six execution units, and a completion buffer for ensuring precise exceptions. This paper presents an instruction-level performance evaluation of the 620 microarchitecture. A performance simulator is developed using the VMW (visualization-based microarchitecture workbench) retargetable framework. The VMW-based simulator accurately models the microarchitecture down to the machine cycle level. Extensive trace-driven simulation is performed using the SPEC92 benchmarks. Detailed quantitative analyses of the effectiveness of all key microarchitecture features are presented. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reducing TLB and memory overhead using online superpage promotion

    Publication Year: 1995 , Page(s): 176 - 187
    Cited by:  Papers (5)  |  Patents (18)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1347 KB)  

    Modern microprocessors contain small translation lookaside buffers (TLBs) that maintain a cache of recently used translations. A TLB's coverage is the sum of the number of bytes mapped by each entry. Applications with working sets larger than the TLB coverage will perform poorly due to high TLB miss rates. Superpages have been proposed as a mechanism for increasing TLB coverage. A superpage is a virtual memory page with size and alignment that are a power of two multiple of the system's base page size. In this paper, we describe online policies for superpage management that monitor TLB miss traffic to decide when a superpage should be constructed. Our policies take into account both the benefit of a superpage promotion (potential for preventing future misses) and the cost (page copying). Although our approach increases the cost of each TLB miss, the net effect is to improve total execution time by eliminating a large number of misses without significantly increasing memory usage, thereby improving system performance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Speeding up irregular applications in shared-memory multiprocessors: memory binding and group prefetching

    Publication Year: 1995 , Page(s): 188 - 199
    Cited by:  Papers (3)  |  Patents (2)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1666 KB)  

    While many parallel applications exhibit good spatial locality, other important codes in areas like graph problem-solving or CAD do not. Often, these irregular codes contain small records accessed via pointers. Consequently, while the former applications benefit from long cache lines, the latter prefer short lines. One good solution is to combine short lines with prefetching. In this way, each application can exploit the amount of spatial locality that it has. However, prefetching, if provided, should also work for the irregular codes. This paper presents a new prefetching scheme that, while usable by regular applications, is specifically targeted to irregular ones: memory binding and group prefetching. The idea is to hardware-bind and prefetch together groups of data that the programmer suggests are strongly related to each other. Examples are the different fields in a record or two records linked by a permanent pointer. This prefetching scheme, combined with short cache lines, results in a memory hierarchy design that can be exploited by both regular and irregular applications. Overall, it is better to use a system with short lines (16-32 bytes) and our prefetching than a system with long lines (128 bytes) with or without our prefetching. The former system runs 6 out of 7 splash-class applications faster. In particular, some of the most irregular applications run 25-40% faster. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient, fully adaptive deadlock recovery scheme: DISHA

    Publication Year: 1995 , Page(s): 201 - 210
    Cited by:  Papers (11)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1110 KB)  

    This paper presents a simple, efficient and cost effective routing strategy that considers deadlock recovery as opposed to prevention. Performance is optimized in the absence of deadlocks by allowing maximum flexibility in routing. DISHA supports truefully adaptive routing where all virtual channels at each node are available to packets without regard for deadlocks. Deadlock cycles, upon forming, are efficiently broken by progressively routing one of the blocked packets through a deadlock-free lane. This lane is implemented using a central "floating" deadlock buffer resource in routers which is accessible to all neighboring routers along the path. Simulations show that the DISHA scheme results in superior performance and is extremely simple, ensuring quick recovery from deadlocks and enabling the design of fast routers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analysis and implementation of hybrid switching

    Publication Year: 1995 , Page(s): 211 - 219
    Cited by:  Papers (2)  |  Patents (1)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (950 KB)  

    The switching scheme of a point-to-point network determines how packets flow through each node, and is a primary element in determining the network's performance. In this paper, we present and evaluate a new switching scheme called hybrid switching. Hybrid switching dynamically combines both virtual cut-through and wormhole switching to provide higher achievable throughput than wormhole alone, while significantly reducing the buffer space required at intermediate nodes when compared to virtual cut-through. This scheme is motivated by a comparison of virtual cut-through and wormhole switching through cycle-level simulations, and then evaluated using the same methods. To show the feasibility of hybrid switching, as well as to provide a common base for simulating and implementing a variety of switching schemes, we have designed SPIDER, a communication adapter built around a custom ASIC, the Programmable Routing Controller (PRC). View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Configurable flow control mechanisms for fault-tolerant routing

    Publication Year: 1995 , Page(s): 220 - 229
    Cited by:  Papers (4)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1096 KB)  

    Fault-tolerant routing protocols in modern interconnection networks rely heavily on the network flow control mechanisms used. Optimistic flow control mechanisms such as wormhole routing (WR) realize very good performance, but are prone to deadlock in the presence of faults. Conservative flow control mechanisms such as pipelined circuit switching (PCS) insures existence of a path to the destination prior to message transmission, but incurs increased overhead. Existing fault-tolerant routing protocols are designed with one or the other, and must accommodate their associated constraints. This paper proposes the use of configurable flow control mechanisms. Routing protocols can then be designed such that in the vicinity of faults, protocols use a more conservative flow control mechanism, while the majority of messages that traverse fault-free portions of the network utilize a WR like flow control to maximize performance. Such protocols are referred to as two-phase protocols where routing decisions are provided some control over the operation of the virtual channels. This ability provides new avenues for optimizing message passing performance in the presence of faults. A fully adaptive two-phase protocol is proposed and compared via simulation to those based on WR and PCS. The architecture of a network router supporting configurable flow control is described, and the paper concludes with avenues for future research. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • NIFDY: a low overhead, high throughput network interface

    Publication Year: 1995 , Page(s): 230 - 241
    Cited by:  Papers (3)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1670 KB)  

    In this paper we present NIFDY, a network interface that uses admission control to reduce congestion and ensures that packets are received by a processor in the order in which they were sent, even if the underlying network delivers the packets out of order. The basic idea behind NIFDY is that each processor is allowed to have at most one outstanding packet to any other processor unless the destination processor has granted the sender the right to send multiple unacknowledged packets. Further, there is a low upper limit on the number of outstanding packets to all processors. We present results from simulations of a variety of networks (meshes, tori, butterflies, and fat trees) and traffic patterns to verify NIFDY's efficacy. Our simulations show that NIFDY increases throughput and decreases overhead. The utility of NIFDY increases as a network's bisection bandwidth decreases. When combined with the increased payload allowed by in-order delivery NIFDY increases total bandwidth delivered for all networks. The resources needed to implement NIFDY are small and constant with respect to network size. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Vector multiprocessors with arbitrated memory access

    Publication Year: 1995 , Page(s): 243 - 252
    Cited by:  Papers (1)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (881 KB)  

    The high latency of memory accesses is one of the factors that contributes to reduce the performance of current vector supercomputers. The conflicts that can occur in the memory modules plus the collisions in the interconnection network in case of multiprocessors make the execution time of applications increase significantly. In this work we propose a memory access method for vector uniprocessors and multiprocessors that allows to perform stream accesses with the smallest possible latency in the majority of the cases. The basic idea is to arbitrate the memory access by defining the order in which the memory modules are visited. The stream elements are requested out of order. In addition, the access method also reduces the cost of the interconnection network. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design of cache memories for multi-threaded dataflow architecture

    Publication Year: 1995 , Page(s): 253 - 264
    Cited by:  Papers (3)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (1133 KB)  

    Cache memories have proven their effectiveness in the von Neumann architecture when localities of reference govern the execution loci of programs. A pure dataflow program, in contrast, contains no locality of reference since the execution sequence is enforced only by the availability of arguments. Instruction locality may be enhanced if, dataflow programs are reordered. Enhancing the locality of data references in the dataflow architecture is a more challenging problem. In this paper we report our approaches to the design of instruction, data (operand) and I-Structure cache memories using the Explicit Token Store (ETS) model of dataflow systems. We will present the performance results obtained using various benchmark programs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Skewed associativity enhances performance predictability

    Publication Year: 1995 , Page(s): 265 - 274
    Cited by:  Papers (5)
    Save to Project icon | Click to expandQuick Abstract | PDF file iconPDF (951 KB)  

    Performance tuning becomes harder as computer technology advances. One of the factors is the increasing complexity of memory hierarchies. Most modern machines now use at least one level of cache memory. To reduce execution stalls, cache misses must be very low. Software techniques used to improve locality have been developed for numerical codes, such as loop blocking and copying. Unfortunately, the behavior of direct mapped and set associative caches is still erratic when large numerical data is accessed. Execution time can vary drastically for the same loop kernel depending on uncontrolled factors such as array leading size. The only software method available to improve execution time stability is the copying of frequently used data, which is costly in execution time. Users are not usually cache organisation experts. They are not aware of such phenomena, and have no control over it. In this paper, we show that the recently proposed four-way skewed associative cache yields very stable execution times and good average miss ratios on blocked algorithms. As a result execution time is faster and much more predictable than with conventional caches. As a result of its better comportment, it is possible to use larger blocks sizes with blocked algorithms, which will furthermore reduce blocking overhead costs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.