By Topic

Frontiers of Massively Parallel Computing, 1996. Proceedings Frontiers '96., Sixth Symposium on the

Date 27-31 Oct. 1996

Filter Results

Displaying Results 1 - 25 of 42
  • The Sixth Symposium on the Frontiers of Massively Parallel Computing [front matter]

    Page(s): iii - viii
    Save to Project icon | Request Permissions | PDF file iconPDF (267 KB)  
    Freely Available from IEEE
  • Gang scheduling for highly efficient, distributed multiprocessor systems

    Page(s): 4 - 12
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1159 KB)  

    We have implemented a job scheduling system for workstation clusters and massively parallel systems with highly efficient message passing interconnects that supports space and time sharing through multiuser gang scheduling of parallel jobs. The system is available on the IBM-SP-2 cluster. It is highly modular, scalable and can easily be adapted to a variety of other MPP systems. The system supports various scheduling policies. We architect the system so that the time-sharing of processors avoids any significant serialization and extra resource consumption, but preserves the reliability and the efficiency of the high performance communication subsystem that characterizes a dedicated non time shared systems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Integrating polling, interrupts, and thread management

    Page(s): 13 - 22
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1192 KB)  

    Many user-level communication systems receive network messages by polling the network adapter from user space. While polling avoids the overhead of interrupt-based mechanisms, it is not suited for all parallel applications. This paper describes a general-purpose, multithreaded, communication system that uses both polling and interrupts to receive messages. Users need not insert polls into their code; through a careful integration of the user-level communication software with a user-level thread scheduler, the system can automatically switch between polling and interrupts. We have evaluated the performance of this integrated system on Myrinet, using a synthetic benchmark and a number of applications that have very different communication requirements. We show that the integrated system achieves robust performance: in most cases, it performs as well as or better than systems that rely exclusively on interrupts or polling. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Pursuing a petaflop: point designs for 100 TF computers using PIM technologies

    Page(s): 88 - 97
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1236 KB)  

    This paper is a summary of a proposal submitted to the NSF 100 Tera Flops Point Design Study. Its main thesis is that the use of Processing-In-Memory (PIM) technology can provide an extremely dense and highly efficient base on which such computing systems can be constructed the paper describes a strawman organization of one potential PIM chip, along with how multiple such chips might be organized into a real system, what the software supporting such a system might look like, and several applications which we will be attempting to place onto such a system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Illinois Aggressive Coma Multiprocessor project (I-ACOMA)

    Page(s): 106 - 111
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (813 KB)  

    While scalable shared-memory multiprocessors with hardware-assisted cache coherence are relatively easy to program. If truly high-performance is desired, they still require substantial programmer effort. For example, data must be allocated close to the processors that will use them and the application must be tuned so that the working set fits in the caches. This is unfortunate because the most important obstacle to widespread use of parallel computing is the hardship of programming parallel machines. The goal of the I-ACOMA project is to explore how to design a highly programmable high-performance multiprocessor. The authors focus on a flat-coma scalable multiprocessor supported by a parallelizing compiler. The main issues that they are studying are advanced processor organizations. Techniques to handle long memory access latencies, and support for important classes of workloads like databases and scientific applications with loops that cannot be compiler analyzed. The project also involves building a prototype that includes some of the features discussed. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Particle-mesh techniques on the MasPar

    Page(s): 154 - 161
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (779 KB)  

    The authors investigate the most efficient implementations of the charge (mass) assignment and force interpolation tasks of a particle-in-cell code on the SIMD architecture of the MasPar MP2. Three different approaches were tested. The first emphasized uniform computational (not necessarily communication) load balance and ease of programming. The second exploited the speed of the Xnet interprocessor communication network using a particle data migration strategy. The third used sorting and vector scan-add operations on the particle dataset to minimize the communication traffic required between the particle and mesh data structures. Algorithm efficiencies were measured as a function of the degree of spatial clustering of the particles, and as a function of the total number of particles. The sort/scan-add strategy gave the best performance for a broad range of degree of spatial clustering. It was only beaten by the migration strategy in the regime of weak clustering. Their results indicate how a hybrid algorithm combining the migration and sort/scan-add strategies can set an upper limit on the performance degradation associated with the spatial clustering of particles. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Intelligent, adaptive file system policy selection

    Page(s): 172 - 179
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (803 KB)  

    Traditionally, maximizing input/output performance has required tailoring application input/output patterns to the idiosyncrasies of specific input/output systems. The authors show that one can achieve high application input/output performance via a low overhead input/output system that automatically recognizes file access patterns and adaptively modifies system policies to match application requirements. This approach reduces the application developer's input/output optimization effort by isolating input/output optimization decisions within a retargetable file system infrastructure. To validate these claims, they have built a lightweight file system policy testbed that uses a trained learning mechanism to recognize access patterns. The file system then uses these access pattern classifications to select appropriate caching strategies, dynamically adapting file system policies to changing input/output demands throughout application execution. The experimental data show dramatic speedups on both benchmarks and input/output intensive scientific applications. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An abstract-device interface for implementing portable parallel-I/O interfaces

    Page(s): 180 - 187
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (870 KB)  

    We propose a strategy for implementing parallel I/O interfaces portably and efficiently. We have defined an abstract device interface for parallel I/O, called ADIO. Any parallel I/O API can be implemented on multiple file systems by implementing the API portably on top of ADIO, and implementing only ADIO on different file systems. This approach simplifies the task of implementing an API and yet exploits the specific high performance features of individual file systems. We have used ADIO to implement the Intel PFS interface and subsets of MPI-IO and IBM PIOFS interfaces on PFS, PIOFS, Unix, and NFS file systems. Our performance studies indicate that the overhead of using ADIO as an implementation strategy is very low. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Preliminary insights on shared memory PIC code performance on the Convex Exemplar SPP1000

    Page(s): 214 - 222
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (866 KB)  

    We implement a 3D Electrostatic Particle in Cell (PIC) code on the HP-Convex Exemplar SPP1000. Our principal goals are to identify the best PIC algorithm for this architecture and capture its performance, and to explore whether the architecture and system software achieve an efficient and scalable shared memory programming environment. We show that PIC codes can achieve good performance on the Exemplar. However to achieve this performance great care is required in minimizing long latencies to remote memory and in maximizing cache reuse. Combined, these two requirements for avoiding performance degradation due to latency resulted in a complex programming task and diminished significantly many advantages that the shared memory hardware provided towards ease-of-use. Our best performing code avoided stressing the cache coherency hardware. Best performance was achieved by storing the particle data in 'processor local' memory blocks, and by intermittent sorting of the particle data to improve processor cache reuse. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An interprocedural framework for determining efficient data redistributions in distributed memory machines

    Page(s): 233 - 240
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (929 KB)  

    This paper presents a framework to find good distributions for the global arrays at different program points in the presence of procedure calls. The distributions are chosen for their ability to offset the re-distribution overheads by contributing significantly towards increasing the performance gains. The algorithm uses interprocedural analysis and dynamic programming techniques. The working of the algorithm has been demonstrated for a CFD kernel. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A fair fast distributed concurrent-reader exclusive-writer synchronization

    Page(s): 246 - 254
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (861 KB)  

    Distributed synchronization is needed to arbitrate access to a shared resource in a message passing system. Reader/writer synchronization can improve efficiency and throughput if a large fraction of accesses to the shared resource are queries. In this paper, we present a highly efficient distributed algorithm that provides FCFS concurrent-reader exclusive-writer synchronization with an amortized O(logn) messages per critical section entry and O(logn) bits of storage per processor. We evaluate the new algorithm with a simulation study, comparing it to fast and low-overhead distributed mutual exclusion algorithms. We find that when the request load contains a large fraction of read locks, our algorithm provides higher throughput and a lower acquisition time latency than is possible with the distributed mutual exclusion algorithms, with a small increase in the number of messages passed per critical section entry. The low space and message passing overhead, and high efficiency make the algorithm scalable and practical for implementation. The algorithm we present can easily be extended to give preference to readers or writers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Lock improvement technique for release consistency in distributed shared memory systems

    Page(s): 255 - 262
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (887 KB)  

    Distributed shared memory allows processes to view the physically distributed memory as a globally shared virtual memory. Lazy release consistency (LRC), among known techniques, is an efficient software model proposed for distributed shared memory. It relies heavily on lock synchronization to maintain data coherency. The lock scheme used in LRC, however, conducts many interrupt invocations on the remote processors, which in turn steal effective cpu cycles from remote processors, thus prolonging the lock acquisition time and the total elapsed time of application programs. In this paper, a lock improvement technique is proposed to alleviate interrupt invocations caused by the lock acquire operations, leading to reduction in the lock acquisition time and the overall program execution time. Our improvement technique was evaluated under the TreadMarks' framework using four applications, where TreadMarks is a distributed shared memory system based on LRC. The experimental results indicate that our technique improves the lock acquisition time over TreadMarks on a network of workstations by more than 14% for one application. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A quasi-barrier technique to improve performance of an irregular application

    Page(s): 263 - 270
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1032 KB)  

    A technique to improve performance on distributed memory machines of irregularly structured Grobner basis computations is developed. In parallel Grobner basis computation, at every step many tasks are executed in parallel by relating the dependencies present in sequential computation. In this relaxation approach, the idle time spent by processors at k/sup th/ step can be reduced by synchronizing p processors when r out of NT/sub k/(r/spl les/NT/sub k//spl les/p) tasks (instead of NT/sub k/) are complete. The analysis presented in this paper shows that, in theory, the improvement in speedup can be as much as lnp when the task distribution is close to exponential. In 70-75% of the experiments carried on IBM SP2 and Intel Paragon, this quasi-barrier technique improved speedup. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performing BMMC permutations in two passes through the expanded delta network and MasPar MP-2

    Page(s): 282 - 289
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (950 KB)  

    This paper examines routing of BMMC (bit-matrix-multiply/complement) permutations on two types of multistage interconnection networks: the expanded delta network and the global router of the MasPar MP-2. BMMC permutations are an important class of permutations that has been well-studied on various multistage networks. The class of BMMC permutations includes as subclasses Gray-code and inverse Gray-code permutations and the entire subclass of bit-permute/complement (BPC) permutations, which in turn includes matrix transpose (with power-of-2 dimensions), bit reversal, vector reversal, hypercube, and matrix reblocking permutations. There are four results in this paper. First, we use linear-algebraic techniques to derive an algorithm to perform any BMMC permutation in at most two passes on the expanded delta network. Second, we use linear-algebraic techniques to derive an algorithm to perform any BMMC permutation in at most two passes on the global router of the MasPar MP-2. Third, we use linear-algebraic and combinatorial analysis to determine the distribution of all BMMC permutations when routed naively through the MP-2 global router and show that most, but not all, BMMC permutations require only one or two passes anyway. We can apply our two-pass algorithms in those cases when naive routing requires more than two passes. Fourth, we present experimental evidence to support our analysis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Macro-star networks: efficient low-degree alternatives to star graphs for large-scale parallel architectures

    Page(s): 290 - 297
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (903 KB)  

    We propose a new class of interconnection networks called macro-star networks, which belong to the class of Cayley graphs and use the star graph as a basic building module. A macro-star network can have a node degree that is considerably smaller than that of a star graph of the same size, and diameter that is asymptotically within a factor of 1.25 from a universal lower bound (given its node degree). We show that algorithms developed for star graphs can be emulated on suitably constructed macro-stars with asymptotically optimal slowdown. In particular we obtain asymptotically optimal algorithms to execute the multinode broadcast and total exchange communication tasks in a macro-star network, under both the single-port and the all-port communication models. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Tools-supported HPF and MPI parallelization of the NAS parallel benchmarks

    Page(s): 309 - 318
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1486 KB)  

    High Performance Fortran (HPF) compilers and communication libraries with the standardized Message Passing Interface (MPI) are becoming widely available, easing the development of portable parallel applications. The Annai tool environment supports programming, debugging and tuning of both HPF- and MPI-based applications. Considering code development time to be as important as final performance, we address how sequential versions of the familiar NAS parallel benchmark kernels can be expediently parallelized with appropriate tool support. While automatic parallelization of scientific applications written in traditional sequential languages remains largely impractical, Annai provides users with high-level language extensions and integrated program engineering support tools. Respectable performance and scalability in most cases are obtained with this straightforward parallelization strategy on the NEC Cenju-3 distributed-memory parallel processor even without recourse to platform-specific optimizations or major program transformations. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Morphological image processing on three parallel machines

    Page(s): 327 - 334
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (900 KB)  

    To assure the parallel implementation of an algorithm performs to its maximum potential, a knowledge of the specific parallel machine being used is required. Mapping gray-scale morphological operators and a filter in a SIMD, a MIMD, and a mixed-mode environment is analyzed. The matching of several algorithmic techniques and machine features are examined analytically and experimentally. Issues considered include concurrent execution of subtasks, data layout, choice of data transfer protocols, and the mode of parallelism used. Experiments are performed using the MIMD Intel Paragon, SIMD MasPar MP-I, and the mixed-mode PASM prototype. The analytical results and experimental procedures can be applied to other systems as well. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • MORPH: a system architecture for robust high performance using customization (an NSF 100 TeraOps point design study)

    Page(s): 336 - 345
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1197 KB)  

    Achieving 100 TeraOps performance within a ten-year horizon will require massively-parallel architectures that exploit both commodity software and hardware technology for cost efficiency. Increasing clock rates and system diameter in clock periods will make efficient management of communication and coordination increasingly critical. Configurable logic presents a unique opportunity to customize bindings, mechanisms, and policies which comprise the interaction of processing, memory, I/O and communication resources. This programming flexibility, or customizability, can provide the key to achieving robust high performance. The Multiprocessor with Reconfigurable Parallel Hardware (MORPH) uses reconfigurable logic blocks integrated with the system core to control policies, interactions, and interconnections. This integrated configurability can improve the performance of local memory hierarchy, increase the efficiency of interprocessor coordination, or better utilize the network bisection of the machine. MORPH provides a framework for exploring such integrated application-specific customizability. Rather than complicate the situation, MORPH's configurability supports component software and interoperability frameworks, allowing direct support for application-specified patterns, objects, and structures. This paper reports the motivation and initial design of the MORPH system. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Architecture, algorithms and applications for future generation supercomputers

    Page(s): 346 - 354
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1367 KB)  

    We outline a hierarchical architecture for machines capable of over 100 teraOps in a 10 year time-frame. The motivating factors for the design are technological feasibility and economic viability. The envisioned architecture can be built largely from commodity components. The development costs of the machine will therefore be shared by the market. To obtain sustained performance from the machine, we propose a heterogeneous programming environment for the machine. The programming environment optimally uses the power of the hierarchy. Programming models for the stronger machine models existing at the lower levels are tuned for ease of programming. Higher levels of the hierarchy place progressively greater emphasis on locality of data reference. The envisioned machine architecture requires new algorithm design methodologies. We propose to develop hierarchical parallel algorithms and scalability metrics for evaluating such algorithms. We identify three important application areas: large scale numerical simulations, problems in particle dynamics and boundary element methods, and emerging large-scale applications such as data-mining. We briefly outline the process of hierarchical algorithm design for each of these application areas. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Hierarchical processors-and-memory architecture for high performance computing

    Page(s): 355 - 362
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (928 KB)  

    This paper outlines a cost-effective multiprocessor architecture that takes into consideration the importance of hardware and software costs as well as delivered performance in the context of real applications. The proposed architecture, HPAM, is organized as a hierarchy of processors-and-memory subsystems. Each subsystem contains a homogeneous parallel machine. Across the levels of the hierarchy, processor speeds and interconnection technology vary. The HPAM design is driven by several considerations: the observed characteristics of real applications, cost-efficiency considerations and the need for ease-of-usage. Rationales and the results of a preliminary study that motivated the design of this architecture are presented. These results include benchmark data that expose the advantages of HPAM over other architectures. Technology trends that support the desirability and viability of the proposed machine organization are also presented. Two classes of applications that demand 100 Teraops computation rates and that will drive future HPAM work are discussed. Furthermore a flexible software environment is proposed for this architecture, which facilitates several programming scenarios: automatic program translation, library based programming and performance-guided coding by expert programmers. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A low-complexity parallel system for gracious scalable performance. Case study for near PetaFLOPS computing

    Page(s): 363 - 370
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (829 KB)  

    This paper presents a "point design" for an MIMD distributed shared-memory parallel computer capable of achieving gracious 100 TeraFLOPS performance with technology that will definitely become feasible/viable in less than a decade. Its scalability guarantees a lifetime extending well into the next century. The design takes advantage of free-space optical technologies, with simple guided-wave concepts, to produce a 1-D building block (BB) that implements efficiently a large, fully-connected system of processors. Designing fully-connected, large systems of electronic processors could be an immediate impact of optics on massively-parallel processing. A 2-D structure is proposed for the complete system, where the aforementioned 1-D BB is extended into two dimensions. This architecture behaves like a 2-D generalized hypercube, which is characterized by outstanding performance and extremely high wiring complexity that prohibits its electronic implementation. With readily available technology, a mesh of clear plastic bars in our design facilitate bit-parallel transmissions that utilize wavelength-division multiplexing and follow dedicated optical paths. Each processor is mounted on a card. Each card contains eight processors interconnected locally via an electronic crossbar. Taking advantage of higher-speed optical technologies all eight processors share the same interface to the optical medium. Encouraging, preliminary results prove that our conservative design could have a tremendous, positive impact on massively-parallel computing in the near future. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96)

    Save to Project icon | Request Permissions | PDF file iconPDF (88 KB)  
    Freely Available from IEEE
  • Largest-job-first-scan-all scheduling policy for 2D mesh-connected systems

    Page(s): 118 - 125
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (844 KB)  

    In general parallel computer systems, scheduling incoming jobs has been identified to be an important factor determining the overall performance, in addition to the allocation of processors to the scheduled jobs. The authors propose an efficient scheduling scheme for two-dimensional meshes. By employing the largest job first and scan-all policy along with the waiting time limit, the proposed scheme can alleviate the fragmentation problem. Contrary to the previous largest-job-first scheduling schemes, large jobs do not block small jobs in the scheme. As a result, the mean response time is significantly reduced as identified by comprehensive computer simulation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Scheduling for large-scale parallel video servers

    Page(s): 126 - 133
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (704 KB)  

    Parallel video servers are necessary for large-scale video-on-demand and other multimedia systems. The paper addresses the scheduling problem of parallel video servers. The authors discuss scheduling requirements. Optimal algorithms are presented for conflict-free scheduling, delay minimization, load balancing, and admission control View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Effect of variation in compile time costs on scheduling tasks on distributed memory systems

    Page(s): 134 - 141
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (700 KB)  

    One of the major limitations of compile time scheduling schemes is the inability to precisely determine the computation and communication costs prior to generating the schedule. The authors address the issue of sensitivity of a given scheduling algorithm to the variations in imprecisely known compile time costs. The variations in the compile time costs can affect in one of the two ways: (i) original schedule found by the algorithm using estimated compile time costs does not change (schedule is invariant), or (ii) original schedule found by the algorithm changes when the costs change. For the first scenario, they have derived the conditions under which the schedule found by our algorithm would be invariant. For those cases where the schedule length changes, they have also introduced a measure of sensitivity of the schedule or the scheduling algorithm, defined as the ratio of percentage change in schedule length to that of maximum allowable percentage change in a node computation cost or an edge communication cost. Through an experimental study they show that the proposed algorithm is extremely insensitive and can be used in practical scheduling situations, where the compile time costs are known imprecisely View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.