By Topic

Parallel and Distributed Systems, IEEE Transactions on

Issue 5 • Date May 2000

Filter Results

Displaying Results 1 - 7 of 7
  • Processor scheduling and allocation for 3D torus multicomputer systems

    Page(s): 475 - 484
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (531 KB)  

    Multicomputer systems achieve high performance by utilizing a number of computing nodes. Recently, by achieving significant reductions in communication delay, the three-dimensional (3D) torus has emerged as a new candidate interconnection topology for message-passing multicomputer systems. In this paper, we propose an efficient processor allocation scheme-scan search scheme-for the 3D torus based on a first-fit approach. The scan search scheme minimizes the average allocation time for an incoming task by effectively manipulating the 3D information on a torus as 2D information using a data structure called the CST (Coverage Status Table). Comprehensive computer simulation reveals that the allocation time of the scan search scheme is always smaller than that of the earlier scheme based on a best-fit approach. The difference gets larger as the input load increases, and it is as much a factor of 3 for high load. To investigate the performance of the proposed scheme in different scheduling environments, we also consider a non-FCFS scheduling policy along with the typical FCFS policy. The allocation time complexity of the scan search scheme is O(LW2H2). This is significantly smaller than that of the existing scheme which is O(L4W4H4). Here, L, W, and H represent the length, width, and height of 3D torus, respectively View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An efficient recognition-complete processor allocation strategy for k-ary n-cube multiprocessors

    Page(s): 485 - 490
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (703 KB)  

    Composed of various topologies, the k-ary n-cube system is desirable for accepting and executing topologically different tasks. To utilize its large amount of processor resources, several allocation strategies have been reported, each with certain restrictions that affect performance. For improvement, we propose a new allocation strategy for the k-ary n-cubes. The proposed strategy is an extension of the TC strategy for hypercubes and is able to recognize all subcubes with different topologies requested by tasks. Complexity analysis and performance comparison between related strategies are provided to demonstrate their advantages and disadvantages. Simulation results show that with full subcube recognition ability and no internal fragmentation, our strategy always exhibits better performance View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Quantitative characterization and analysis of the I/O behavior of a commercial distributed-shared-memory machine

    Page(s): 509 - 526
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1015 KB)  

    This paper presents a unified evaluation of the I/O behavior of a commercial clustered DSM machine, the HP Exemplar. Our study has the following objectives: 1) To evaluate the impact of different interacting system components, namely, architecture, operating system, and programming model, on the overall I/O behavior and identify possible performance bottlenecks, and 2) To provide hints to the users for achieving high out-of-box I/O throughput. We find that for the DSM machines that are built as a cluster of SMP nodes, integrated clustering of computing and I/O resources, both hardware and software, is not advantageous for two reasons. First, within an SMP node, the I/O bandwidth is often restricted by the performance of the peripheral components and cannot match the memory bandwidth. Second, since the I/O resources are shared as a global resource, the file-access costs become nonuniform and the I/O behavior of the entire system, in terms of both scalability and balance, degrades. We observe that the buffered I/O performance is determined not only by the I/O subsystem, but also by the programming model, global-shared memory subsystem, and data-communication mechanism. Moreover, programming-model support can be used effectively to overcome the performance constraints created by the architecture and operating system. For example, on the HP Exemplar, users can achieve high I/O throughput by using features of the programming model that balance the sharing and locality of the user buffers and file systems. Finally, we believe that at present, the I/O subsystems are being designed in isolation, and there is a need for mending the traditional memory-oriented design approach to address this problem View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Minimizing communication in the bitonic sort

    Page(s): 459 - 474
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1220 KB)  

    This paper presents bitonic sorting schemes for special-purpose parallel architectures such as sorting networks and for general-purpose parallel architectures such as SIMD and/or MIMD computers. First, bitonic sorting algorithms for shared-memory SIMD and/or MIMD computers are developed. Shared-memory accesses through the interconnection network of shared memory SIMD and/or MIMD computers can be very time consuming. A scheme is introduced which reduces the number of such accesses. This scheme is based on the parity strategy which is the main idea of the paper. By reducing the communication through the network, a performance improvement is achieved. Second, a recirculating bitonic sorting network is presented, which is composed of one level of N/2 comparators plus an Ω-network of (log N-1) switch levels. This network reduces the cost complexity to O(N log N) compared with the O(N log2 N) of the original bitonic sorting network, while preserving the same time complexity. Finally, a simplified multistage bitonic sorting network, is presented. For simplifying the interlevel wiring, the parity strategy is used, so N/2 keys are wired straight through the network View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • On performance prediction of parallel computations with precedent constraints

    Page(s): 491 - 508
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (547 KB)  

    Performance analysis of concurrent executions in parallel systems has been recognized as a challenging problem. The aim of this research is to study approximate but efficient solution techniques for this problem. We model the structure of a parallel machine and the structure of the jobs executing on such a system. We investigate rich classes of jobs, which can be expressed by series, parallel-and, parallel-or, and probabilistic-fork. We propose an efficient performance prediction method for these classes of jobs running on a parallel environment which is modeled by a standard queueing network model. The proposed prediction method is computationally efficient, it has polynomial complexity in both time and space. The time complexity is O(C2N2K) and the space complexity is O(C2 N2K), where C is the number of job classes in the system, the number of tasks in each job class is O(N), and K is the number of service centers in the queueing model. The accuracy of the approximate solution is validated via simulation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A class of highly scalable optical crossbar-connected interconnection networks (SOCNs) for parallel computing systems

    Page(s): 444 - 458
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2446 KB)  

    A class of highly scalable interconnect topologies called the Scalable Optical Crossbar-Connected Interconnection Networks (SOCNs) is proposed. This proposed class of networks combines the use of tunable Vertical Cavity Surface Emitting Lasers (VCSEL's), Wavelength Division Multiplexing (WDM) and a scalable, hierarchical network architecture to implement large-scale optical crossbar based networks. A free-space and optical waveguide-based crossbar interconnect utilizing tunable VCSEL arrays is proposed for interconnecting processor elements within a local cluster. A similar WDM optical crossbar using optical fibers is proposed for implementing intercluster crossbar links. The combination of the two technologies produces large-scale optical fan-out switches that could be used to implement relatively low cost, large scale, high bandwidth, low latency, fully connected crossbar clusters supporting up to hundreds of processors. An extension of the crossbar network architecture is also proposed that implements a hybrid network architecture that is much more scalable. This could be used to connect thousands of processors in a multiprocessor configuration while maintaining a low latency and high bandwidth. Such an architecture could be very suitable for constructing relatively inexpensive, highly scalable, high bandwidth, and fault-tolerant interconnects for large-scale, massively parallel computer systems. This paper presents a thorough analysis of two example topologies, including a comparison of the two topologies to other popular networks. In addition, an overview of a proposed optical implementation and power budget is presented, along with analysis of proposed media access control protocols and corresponding optical implementation View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A systolic image difference algorithm for RLE-compressed images

    Page(s): 433 - 443
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (375 KB)  

    A new systolic algorithm which computes image differences in run-length encoded (RLE) format is described. The binary image difference operation is commonly used in many image processing applications including automated inspection systems, character recognition, fingerprint analysis, and motion detection. The efficiency of these operations can be improved significantly with the availability of a fast systolic system that computes the image difference as described in this paper. It is shown that for images with a high similarity measure, the time complexity of the systolic algorithm is small and, in some cases, constant with respect to the image size. A formal proof of correctness for the algorithm is also given View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
David Bader
College of Computing
Georgia Institute of Technology