By Topic

Parallel and Distributed Systems, IEEE Transactions on

Issue 2 • Date Feb 2003

Filter Results

Displaying Results 1 - 8 of 8
  • Symbolic performance modeling of parallel systems

    Page(s): 154 - 165
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (386 KB) |  | HTML iconHTML  

    Performance prediction is an important engineering tool that provides valuable feedback on design choices in program synthesis and machine architecture development. We present an analytic performance modeling approach aimed to minimize prediction cost, while providing a prediction accuracy that is sufficient to enable major code and data mapping decisions. Our approach is based on a performance simulation language called PAMELA. Apart from simulation, PAMELA features a symbolic analysis technique that enables PAMELA models to be compiled into symbolic performance models that trade prediction accuracy for the lowest possible solution cost. We demonstrate our approach through a large number of theoretical and practical modeling case studies, including six parallel programs and two distributed-memory machines. The average prediction error of our approach is less than 10 percent, while the average worst-case error is limited to 50 percent. It is shown that this accuracy is sufficient to correctly select the best coding or partitioning strategy. For programs expressed in a high-level, structured programming model, such as data-parallel programs, symbolic performance modeling can be entirely automated. We report on experiments with a PAMELA model generator built within a dataparallel compiler for distributed-memory machines. Our results show that with negligible program annotation, symbolic performance models are automatically compiled in seconds, while their solution cost is in the order of milliseconds. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Nonblocking k-fold multicast networks

    Page(s): 131 - 141
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (574 KB) |  | HTML iconHTML  

    Multicast communication involves transmitting information from a single source to multiple destinations and is a requirement in high-performance networks. Current trends in networking applications indicate an increasing demand in future networks for multicast capability. Many multicast applications require not only multicast capability, but also predictable communication performance such as guaranteed multicast latency and bandwidth. In this paper, we present a design for a nonblocking k-fold multicast network, in which any destination node can be involved in up to k simultaneous multicast connections in a nonblocking manner. We also develop an efficient routing algorithm for the network. As can be seen, a k-fold multicast network has significantly lower network cost than that of k copies of ordinary 1-fold multicast networks and is a cost effective choice for supporting arbitrary multicast communication. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A pipeline-based approach for scheduling video processing algorithms on NOW

    Page(s): 119 - 130
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3261 KB) |  | HTML iconHTML  

    Network Of Workstations (NOW) platforms put together with off-the-shelf workstations and networking hardware have become a cost effective, scalable, and flexible platform for video processing applications. Still, one has to manually schedule an algorithm to the available processors of the NOW to make efficient use of the resources. However, this approach is time-consuming and impractical for a video processing system that must perform a variety of different algorithms, with new algorithms being constantly developed. Improved support for program development is absolutely necessary before the full benefits of parallel architectures can be realized for video processing applications. Toward this goal, an automatic compile-time scheduler has been developed to schedule input tasks of video processing applications with precedence constraints onto available processors. The scheduler exploits both spatial (parallelism) and temporal (pipelining) concurrency to make the best use of machine resources. Two important scheduling problems are addressed. First, given a task graph and a desired throughput, a schedule is constructed to achieve the desired throughput with the minimum number of processors. Second, given a task graph and a finite set of available resources, a schedule is constructed such that the throughput is maximized while meeting the resource constraints. Results from simulations show that the scheduler and proposed optimization techniques effectively tackle these problems by maximizing processor utilization. A code generator has been developed to generate parallel programs automatically. The tools developed in this paper make it much easier for a programmer to develop video processing applications on these parallel architectures. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Analytic evaluation of shared-memory architectures

    Page(s): 166 - 180
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2792 KB) |  | HTML iconHTML  

    This paper develops and validates an efficient analytical model for evaluating the performance of shared memory architectures with ILP processors. First, we instrument the SimOS simulator to measure the parameters for such a model and we find a surprisingly high degree of processor memory request heterogeneity in the workloads. Examining the model parameters provides insight into application behaviors and how they interact with the system. Second, we create a model that captures such heterogeneous processor behavior, which is important for analyzing memory system design tradeoffs. Highly bursty memory request traffic and lock contention are also modeled in a significantly more robust way than in previous work. With these features, the model is applicable to a wide range of architectures and applications. Although the features increase the model complexity, it is a useful design tool because the size of the model input parameter set remains manageable, and the model is still several orders of magnitude quicker to solve than detailed simulation. Validation results show that the model is highly accurate, producing heterogeneous per processor throughputs that are generally within 5 percent and, for the workloads validated, always within 13 percent of the values measured by detailed simulation with SimOS. Several examples illustrate applications of the model to studying architectural design issues and the interactions between the architecture and the application workloads. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A comparison of three artificial life techniques for reporting cell planning in mobile computing

    Page(s): 142 - 153
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1948 KB) |  | HTML iconHTML  

    Location management is a very important and complex problem in today's mobile computing environments. There is a need to develop algorithms that could capture this complexity yet can be easily implemented and used to solve a wide range of location management scenarios. Artificial life techniques have been used to solve a wide range of complex problems in recent times. The power of these techniques stems from their capability in searching large search spaces, which arise in many combinatorial optimization problems, very efficiently. This paper compares several well-known artificial life techniques to gauge their suitability for solving location management problems. Due to their popularity and robustness, a genetic algorithm (GA), tabu search (TS), and ant colony algorithm (ACA) are used to solve the reporting cells planning problem. In the reporting cell location management scheme, some cells in the network are designated as reporting cells; mobile terminals update their positions (location update) upon entering one of these reporting cells. To create such a planner, a GA, TS, as well as several different AC algorithms are implemented. The effectiveness of each algorithm is shown for a number of test problems. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • When the herd is smart: Aggregate behavior in the selection of job request

    Page(s): 181 - 192
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1258 KB) |  | HTML iconHTML  

    In most parallel supercomputers, submitting a job for execution involves specifying how many processors are to be allocated to the job. When the job is moldable (i.e., there is a choice on how many processors the job uses), an application scheduler called SA can significantly improve job performance by automatically selecting how many processors to use. Since most jobs are moldable, this result has great impact to the current state of practice in supercomputer scheduling. However, the widespread use of SA can change the nature of workload processed by supercomputers. When many SAs are scheduling jobs on one supercomputer, the decision made by one SA affects the state of the system, therefore impacting other instances of SA. In this case, the global behavior of the system comes from the aggregate behavior caused by all SAs. In particular, it is reasonable to expect the competition for resources to become tougher with multiple SAs, and this tough competition to decrease the performance improvement attained by each SA individually. This paper investigates this very issue. We found that the increased competition indeed makes it harder for each individual instance of SA to improve job performance. Nevertheless, there are two other aggregate behaviors that override increased competition when the system load is moderate to heavy. First, as load goes up, SA chooses smaller requests, which increases efficiency, which effectively decreases the offered load, which mitigates long wait times. Second, better job packing and fewer jobs in the system make it easier for incoming jobs to fit in the supercomputer schedule, thus reducing wait times further. As a result, in moderate to heavy load conditions, a single instance of SA benefits from the fact that other jobs are also using SA. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Algorithms for supporting compiled communication

    Page(s): 107 - 118
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (537 KB) |  | HTML iconHTML  

    We investigate the compiler algorithms to support compiled communication in multiprocessor environments and study the benefits of compiled communication, assuming that the underlying network is an all-optical time-division-multiplexing (TDM) network. We present an experimental compiler, E-SUIF, that supports compiled communication for High Performance Fortran (HPF) like programs on all-optical TDM networks, and describe and evaluate the compiler algorithms used in E-SUIF. We further demonstrate the effectiveness of compiled communication on all-optical TDM networks by comparing the performance of compiled communication with that of a traditional communication method using a number of application programs. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Accountable Web-computing

    Page(s): 97 - 106
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (321 KB) |  | HTML iconHTML  

    Web-based computing (WBC) is a modality of collaborative computing wherein "volunteers" register at a Web site, receiving one (usually compute-intensive) task to compute at each visit and returning the results from that task at the subsequent visit. The security of a WBC project is enhanced if the owner of the Web site can easily keep track of which "volunteer" computed which tasks, thereby endowing the project with accountability. We develop a framework for constructing computationally lightweight schemes for endowing WBC projects with accountability. The framework is built around the notion of a directly computed task allocation function (TAF) that reserves a dedicated subset of the Web site's tasks for each "volunteer." We show how TAFs simplify the data structures needed to link "volunteers" with their tasks, even when "volunteers" are allowed to join and leave the WBC project dynamically. We then design a methodology for constructing easily computed TAFs that enhance the efficiency of the accountability scheme. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
David Bader
College of Computing
Georgia Institute of Technology