By Topic

High-Performance Computing, 1997. Proceedings. Fourth International Conference on

Date 18-21 Dec. 1997

Filter Results

Displaying Results 1 - 25 of 83
  • Proceedings Fourth International Conference on High-Performance Computing

    Save to Project icon | Request Permissions | PDF file iconPDF (451 KB)  
    Freely Available from IEEE
  • Conference Organization

    Page(s): xviii - xxiii
    Save to Project icon | Request Permissions | PDF file iconPDF (418 KB)  
    Freely Available from IEEE
  • Author index

    Page(s): 539 - 541
    Save to Project icon | Request Permissions | PDF file iconPDF (141 KB)  
    Freely Available from IEEE
  • FP-map-an approach to the functional pipelining of embedded programs

    Page(s): 415 - 420
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (632 KB)  

    Practice shows that increasing the amount of instruction level parallelism offered by an architecture (like adding instruction slots to VLIW instructions) does not necessarily lead to significant performance gains. Instead, high hardware costs and inefficient use of this hardware may occur. Mapping embedded applications onto multiprocessor systems forms a very interesting extension to ILP. We propose a functional pipelining approach to the mapping of embedded programs written in ANSI C onto a pipeline of application specific processors. Our novel functional pipelining algorithm has low computational complexity and was especially developed to form the parallelization engine of a (semi) automatic system for multiprocessor embedded system design. The paper explains the proposed algorithm and demonstrates its applicability View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A tight layout of the cube-connected cycles

    Page(s): 422 - 427
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (508 KB)  

    F.P. Preparata and J. Vuillemin (1981) proposed the cube connected cycles (CCC) and in the same paper, gave an asymptotically optimal layout scheme for the CCC. We give a new layout scheme for the CCC which requires less than half of the area of the Preparata-Vuillemin layout. We also give a non trivial lower bound on the layout area of the CCC. There is a constant factor of 2 between the new layout and the lower bound. We conjecture that the new layout is optimal (minimal) View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Probabilistic routing in wavelength-routed multistage, hypercube, and Debruijn networks

    Page(s): 310 - 315
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (868 KB)  

    Optical networks based on wavelength division multiplexing (WDM) and wavelength routing are considered to be potential candidates for the next generation of wide area networks. One of the main issues in these networks is the development of efficient routing algorithms which require a minimum number of wavelengths. We focus on the permutation routing problem in multistage WDM networks which we call 2-multinets. We present a simple, oblivious probabilistic approach which solves the permutation routing problem on 2-multinets with very high probability (in the usual theoretical sense) using O(log2N/loglogN) wavelengths, where N is the number of nodes in the network, thereby improving the previous result due to Pankaj and Gallager (1995) that requires O(log3 N) wavelengths. Our approach is advantageous and practical as it is simple, oblivious, and suitable for centralized as well as distributed implementations. We also note that O(logN) wavelengths will suffice with good probabilistic guarantee for the case of dynamic permutation routing where requests arrive and terminate without any relation to each other. The above results are for networks with wavelength converters and we show that the use of converters can be eliminated at the expense of a factor of log N more wavelengths. We also show how our approach can be used to solve the dynamic permutation routing problem well (in practice), using O(1) wavelengths on the hypercube and O(logN) wavelengths on the Debruijn network. These improve the previous known bounds of O(logN) and O(log2N), respectively View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Applying Time Warp to CPU design

    Page(s): 290 - 295
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (592 KB)  

    This paper exemplifies the similarities in Time Warp and computer architecture concepts and terminology, and the continued trend for convergence of ideas in these two areas. Time Warp can provide a means to describe the complex mechanisms being used to allow the instruction execution window to be enlarged. Furthermore it can extend the current mechanisms, which do not scale, in a scalable manner. The issues involved in implementing Time Warp in a CPU design are also examined, and illustrated with reference to the Wisconsin Multiscalar machine and the Waikato WarpEngine. Finally the potential performance gains of such a system are briefly discussed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Compact and flexible linear-array-based implementations of a pipeline of multiprocessor modules (PMMLA) for high throughput applications

    Page(s): 296 - 301
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (640 KB)  

    High throughput is required in many tasks, especially real-time applications. A logical structure of a parallel computing system for such applications is a pipeline of multiprocessor modules to be referred to as PMM. In this paper, a linear array based realization of PMM, referred to as PMMLA (PMM based on Linear Array) is proposed. The main design objective is to achieve uncompromised performance by a compact and flexible hardware structure. This paper describes the organization and operation of a PMMLA, analyzes its performance in detail, and compares it to other possible implementations theoretically and via emulation on nCUBE/2 View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Simultaneous multithreaded vector architecture: merging ILP and DLP for high performance

    Page(s): 350 - 357
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (812 KB)  

    Shows that instruction-level parallelism (ILP) and data-level parallelism (DLP) can be merged in a single simultaneous vector multithreaded architecture to execute regular vectorizable code at a performance level that cannot be achieved using either paradigm on its own. We show that the combination of the two techniques yields very high performance at a low cost and a low complexity. We show that this architecture achieves a sustained performance on numerical regular codes that is 20 times the performance that can be achieved with today's superscalar microprocessors. Moreover, we show that the architecture can tolerate very large memory latencies, of up to a 100 cycles, with a relatively small performance degradation. This high performance is independent of working set size or of locality considerations, since the DLP paradigm allows very efficient exploitation of a high-performance flat memory bandwidth View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Concurrency control of nested cooperative transactions in active DBMS

    Page(s): 4 - 9
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (556 KB)  

    Active database management systems (ADBMSs) use event-condition-action (ECA) rules. Each ECA rule specifies what action is to be taken when an event occurs and the specified condition is satisfied. In this paper, we introduce a concurrency control scheme for handling nested cooperative transactions using detached-mode ECA rules of an ADBMS. A state transition model has been proposed to specify different kinds of nested cooperative transactions using detached-mode ECA rules. The correctness criterion for concurrent execution of such nested cooperative transactions has been stated formally. The problem of verification of the correct schedules and a concurrency control mechanism have also been dealt with View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Sparse matrix decomposition with optimal load balancing

    Page(s): 224 - 229
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (664 KB)  

    Optimal load balancing in sparse matrix decomposition without disturbing the row/column ordering is investigated. Both asymptotically and run time efficient exact algorithms are proposed and implemented for one dimensional (1D) striping and two dimensional (2D) jagged partitioning. Binary search method is successfully adopted to 1D striped decomposition by deriving and exploiting a good upper bound on the value of an optimal solution. A binary search algorithm is proposed for 2D jagged partitioning by introducing a new 2D probing scheme. A new iterative refinement scheme is proposed for both 1D and 2D partitioning. The proposed algorithms are also space efficient since they only need the contentional compressed storage scheme for the given matrix, avoiding the need for a dense workload matrix in 2D decomposition. Experimental results on a wide set of test matrices show that considerably better decompositions can be obtained by using optimal load balancing algorithms instead of heuristics. Proposed algorithms are 100 times faster than a single sparse matrix vector multiplication (SpMxV), in the 64 way 1D decompositions, on the overall average. Our jagged partitioning algorithms are only 60% slower than a single SpMxV computation in the 8×8 way 2D decompositions, on the overall average View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Applications of BSR model of computation for subsegment problems

    Page(s): 126 - 131
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (444 KB)  

    We investigate the BSR (Broadcast with Selective Reduction) model of computation for problems related with subsegments (basically, inputs are arrays containing elements and outputs are “some contiguous part” of inputs), namely the MSSP (Maximal Sum Subsegment Problem, 1D and 2D versions) and LIS (Longest Increasing Sequence). The BSR model of computation introduced by Akl et al. (1989) is more powerful than any CRCW PRAM and yet requires no more resources for implementation than an EREW PRAM. In our solution, we need to add a new reduction operator in the basic brick of the interconnection unit. The implementation of this operator is straightforward and does not require more resource than those in the original BSR implementation; no fundamental architectural piece of the basic model is changed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An object oriented system for developing distributed applications

    Page(s): 192 - 197
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (472 KB)  

    With the recent advances in communication technology and the availability of powerful desktop computers, networking has gained popularity and many applications are being moved onto the Internet. To ease the development of distributed applications, software support to facilitate coordination and communication is needed. The paper describes an object oriented system for structured design and development of distributed applications. The basic system consists of a set of multithreaded servers, one server for each site in the network, which provide some basic communication facilities. The system has been developed using JAVA as the programming language. We use the software support provided by this basic system to define commonly used patterns of interaction in distributed applications. We also identify several different techniques for systematic composition of patterns to develop different applications. We illustrate the use of our system by defining some patterns and using them to build a sample application View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A high performance two dimensional scalable parallel algorithm for solving sparse triangular systems

    Page(s): 137 - 143
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (604 KB)  

    Solving a system of equations of the form Tx=y, where T is a sparse triangular matrix, is required after the factorization phase in the direct methods of solving systems of linear equations. A few parallel formulations have been proposed recently. The common belief in parallelizing this problem is that the parallel formulation utilizing a two dimensional distribution of T is unscalable. We propose the first known efficient scalable parallel algorithm which uses a two dimensional block cyclic distribution of T. The algorithm is shown to be applicable to dense as well as sparse triangular solvers. Since most of the known highly scalable algorithms employed in the factorization phase yield a two dimensional distribution of T, our algorithm avoids the redistribution cost incurred by the one dimensional algorithms. We present the parallel runtime and scalability analyses of the proposed two dimensional algorithm. The dense triangular solver is shown to be scalable. The sparse triangular solver is shown to be at least as scalable as the dense solver. We also show that it is optimal for one class of sparse systems. The experimental results of the sparse triangular solver show that it has good speedup characteristics and yields high performance for a variety of sparse systems View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Highly accurate data value prediction

    Page(s): 358 - 363
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (556 KB)  

    Data dependences (data flow constraints) present a stubborn resistance to the amount of instruction-level parallelism that can be exploited from a program. Recent work has suggested that the limits imposed by data dependences can be overcome to some extent with the use of data value prediction. That is, when an instruction is fetched, its result can be predicted so that subsequent instructions that depend on the result can use this predicted value. When the correct result becomes available, all instructions that are data-dependent on that prediction can be validated. This paper investigates a variety of techniques to carry out highly accurate data value predictions. The first technique investigates the potential of using correlation in data value predictions. The second technique investigates the potential of monitoring the strides by which the results produced by different instances of an instruction change. The third technique investigates the potential of pattern-based two-level prediction schemes. The paper also presents the results from a simulation study that we conducted to verify the potential of the investigated prediction schemes. The results show that highly accurate data valve predictions are possible with two of the investigated schemes View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A hierarchical processor scheduling policy for distributed-memory multicomputer systems

    Page(s): 218 - 223
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (756 KB)  

    Processor scheduling policies for distributed memory systems can be divided into space sharing or time sharing policies. In space sharing, the set of processors in the system is partitioned and each partition is assigned for the exclusive use of a job. In time sharing policies, on the other hand, none of the processors is given exclusively to jobs; instead, several jobs share the processors (for example, in a round robin fashion). There are advantages and disadvantages associated with each type of policy. Typically, space sharing policies are good at low to moderate system loads and when job parallelism does not vary much. However, at high system loads and widely varying job parallelism, time sharing policies provide a better performance. We propose a new policy that is based on a hierarchical organization that incorporates the merits of these two types of policies. The new policy is a hybrid policy that uses both space sharing as well as time sharing to achieve better performance. We demonstrate that, at most system loads of interest, the proposed policy outperforms both space sharing and time sharing policies by a wide margin View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Design of a parallel C language for distributed systems

    Page(s): 174 - 179
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (476 KB)  

    The performance of a distributed system depends upon the efficiency of job distribution among processing nodes, as well as that of its system architecture and operating system. The paper presents an extended C language, ParaC, that supports efficient parallel programming on distributed systems. ParaC is designed to reduce the effort of job distribution on distributed programming environments. Our design includes the description of design goals for the parallel language, the definition of a programming model and the design of ParaC constructs. The paper also addresses the detailed design issues related to translation and finally presents our prototype View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A different approach to high performance computing

    Page(s): 22 - 27
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (568 KB)  

    A common approach to enhance the performance of processors is to increase the number of function units which operate concurrently. We observe this development in all recent superscalar and VLIW (very-long instruction word) processors. VLIWs are easier extensible to high performance ranges because they lack much of the superscalar hardware required for dependence checking and hardware resource allocation; instead they rely on a compiler to perform these tasks. In this paper, we proceed along this line and go one step further in replacing hardware by software complexity: a new architecture is proposed which requires the scheduling and allocation of transports at compile-time, instead of performing this at run-time. This reduces hardware complexity and creates several new compile-time optimizations. The paper illustrates the compilation steps required, explains the concept and characteristics of the proposed architecture, and shows several measurements which confirm our belief that, especially for high-performance embedded applications, this architecture is very attractive View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Mapping of neural network models onto massively parallel hierarchical computer systems

    Page(s): 42 - 47
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (332 KB)  

    Investigates the proposed implementation of neural networks on massively parallel hierarchical computer systems with hypernet topology. The proposed mapping scheme takes advantage of the inherent structure of hypernets to process multiple copies of the neural network in the different subnets, each executing a portion of the training set. Finally, the weight changes in all the subnets are accumulated to adjust the synaptic weights in all the copies. An expression is derived to estimate the time for all-to-all broadcasting, the principal mode of communication in implementing neural networks on parallel computers. This is later used to estimate the time required to execute various execution phases in the neural network algorithm, and thus to estimate the speedup performance of the hypernet in implementing neural networks View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • LIFE: a limited injection, fully adaptive, recovery-based routing algorithm

    Page(s): 316 - 321
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (608 KB)  

    Networks using wormhole switching have traditionally relied upon deadlock avoidance strategies for the design of deadlock-free algorithms. The past few years have seen a rise in popularity of deadlock recovery strategies, that are based on the property that deadlocks are quite rare in practice and happen only at or beyond the network saturation point. In fact, recovery-based routing algorithms have a higher potential performance over the deadlock avoidance-based ones which allow less routing freedom. We present a recovery-based fully adaptive routing algorithm, LIFE, which is based on an innovative injection policy that reduces the probability of deadlocks to negligible values, both with uniform and non-uniform traffic patterns. The experimental results, conducted on an 8-ary 3-cube with 512 nodes, show that it is possible to implement true fully adaptive routing using only two virtual channels. Also, LIFE outperforms state-of-the-art avoidance- and recovery-based algorithms of the same cost both in terms of throughput and message latency under uniform traffic and provides stable throughput under non-uniform traffic patterns View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Building reliable distributed programs with file operations

    Page(s): 380 - 385
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (696 KB)  

    Describes a new protocol that helps the user in building reliable distributed applications with file operations. Our file checkpointing and recovery protocol is designed to consistently checkpoint and recover user files with respect to the volatile state of the distributed program. Based on the protocol, a file I/O interface has been implemented as part of our Libra library for supporting fault tolerance in distributed applications. File operations are done using this interface whereas the complexity of checkpointing and recovering user files is hidden from the application level-the checkpointing and recovery of user files are done automatically View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Load balancing sequences of unstructured adaptive grids

    Page(s): 212 - 217
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (600 KB)  

    Mesh adaption is a powerful tool for efficient unstructured grid computations but causes load imbalance on multiprocessor systems. To address this problem, we have developed PLUM, an automatic portable framework for performing adaptive large scale numerical computations in a message passing environment. The paper makes several important additions to our previous work. First, a new remapping cost model is presented and empirically validated on an SP2. Next, our load balancing strategy is applied to sequences of dynamically adapted unstructured grids. Results indicate that our framework is effective on many processors for both steady and unsteady problems with several levels of adaption. Additionally, we demonstrate that a coarse starting mesh produces high quality load balancing, at a fraction of the cost required for a fine initial mesh. Finally, we show that the data remapping overhead can be significantly reduced by applying our heuristic processor reassignment algorithm View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Fast reductions on a network of workstations

    Page(s): 468 - 473
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (568 KB)  

    Reduction operations are very useful in parallel and distributed computing, with applications in barrier synchronization, distributed snapshots, termination detection, global virtual time computation, etc. In the context of parallel discrete event simulations, we have previously introduced a class of adaptive synchronization algorithms based on fast reductions. We explore the implementation of fast reductions on a popular high performance computing platform-a network of workstations. The specific platform is a set of Pentium Pro PC's running the Linux operating system interconnected by Myrinet-a Gbps network. The general reduction model on which our synchronization algorithms are based is introduced first, followed by a description of how this model can be implemented. We discuss several design trade offs that must be made in order to achieve the driving goal of high speed reductions and provide innovative algorithms to meet the correctness and performance requirements of the reduction model View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel domain decomposition and load balancing using space-filling curves

    Page(s): 230 - 235
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (600 KB)  

    Partitioning techniques based on space filling curves have received much recent attention due to their low running time and good load balance characteristics. The basic idea underlying these methods is to order the multidimensional data according to a space filling curve and partition the resulting one dimensional order. However, space filling curves are defined for points that lie on a uniform grid of a particular resolution. It is typically assumed that the coordinates of the points are representable using a fixed number of bits, and the run times of the algorithms depend upon the number of bits used. We present a simple and efficient technique for ordering arbitrary and dynamic multidimensional data using space filling curves and its application to parallel domain decomposition and load balancing. Our technique is based on a comparison routine that determines the relative position of two points in the order induced by a space filling curve. The comparison routine could then be used in conjunction with any parallel sorting algorithm to effect parallel domain decomposition View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Parallel program design in visual environment

    Page(s): 198 - 203
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (428 KB)  

    The great challenge in parallel computing is to make the task of programming parallel machines easy, while not sacrificing the efficiency of target code. One of the successful methodologies is to start from a high level specification of the functional behaviour of a program by applying a sequence of optimising transformations tuned for a particular architecture to generate a specification of the operational behaviour of a parallel program. We believe that visualisation is an excellent way to bring this methodology to a wider programming community. We describe an interactive visual system which integrates 3D graphics, animation, and direct manipulation techniques into a parallel programming environment View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.