By Topic

Parallel and Distributed Systems, IEEE Transactions on

Issue 4 • Date Jul 1992

Filter Results

Displaying Results 1 - 11 of 11
  • SNAP: a market-propagation architecture for knowledge processing

    Page(s): 397 - 410
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1208 KB)  

    The semantic network array processor (SNAP), a highly parallel architecture targeted to artificial intelligence applications, and in particular natural language understanding, is presented. The knowledge is represented in a form of the semantic network. The knowledge base is distributed among the elements of the SNAP array, and the processing is performed locally where the knowledge is stored. A set of powerful instructions specific to knowledge processing is implemented directly in hardware. SNAP is packaged into 256 custom-designed chips assembled on four printed circuit boards and can store a 16 K node semantic network. SNAP is a marker propagation architecture in which the movement of markers between cells is controlled by propagation rules. Various reasoning mechanisms are implemented with these marker propagation rules View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Partitioning and labeling of loops by unimodular transformations

    Page(s): 465 - 476
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (968 KB)  

    A general method for the identification of the independent subsets in loops with constant dependence vectors is presented. It is shown that the dependence relation remains invariant under a unimodular transformation. Then a unimodular transformation is used to bring the dependence matrix into a form where the independent subsets are obtained by a direct and inexpensive partitioning algorithm. This leads to a procedure for the automatic conversion of a serial loop into a nest of parallel DO-ALL loops. Another unimodular transformation results in an algorithm to label the dependent iterations of an n-fold nested loop in O(n2) time. This provides a multithreaded dynamic scheduling scheme requiring only one fork and one join primitive View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance measurement intrusion and perturbation analysis

    Page(s): 433 - 450
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1544 KB)  

    The authors study the instrumentation perturbations of software event tracing on the Alliant FX/80 vector multiprocessor in sequential, vector, concurrent, and vector-concurrent modes. Based on experimental data, they derive a perturbation model that can approximate true performance from instrumented execution. They analyze the effects of instrumentation coverage, (i.e., the ratio of instrumented to executed statements), source level instrumentation, and hardware interactions. The results show that perturbations in execution times for complete trace instrumentations can exceed three orders of magnitude. With appropriate models of performance perturbation, these perturbations in execution time can be reduced to less than 20% while retaining the additional information from detailed traces. In general, it is concluded that it is possible to characterize perturbations through simple models. This permits more detailed, accurate instrumentation than traditionally believed possible View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reliable distributed sorting through the application-oriented fault tolerance paradigm

    Page(s): 411 - 420
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (700 KB)  

    A fault-tolerant parallel sorting algorithm developed using the application-oriented fault tolerance paradigm is presented. The algorithm is tolerant of one processor/link failure in an n-cube. The addition of reliability to the sorting algorithm results in a performance penalty. Asymptotically, the fault-tolerant algorithm is less costly than host sorting. Experimentally it is shown that fault-tolerant sorting quickly becomes more efficient that host sorting when the bitonic sort/merge is considered. The main contribution is the demonstration that the application-oriented fault tolerance paradigm is applicable to problems of a noniterative-convergent nature View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Performance measurement and trace driven simulation of parallel CAD and numeric applications on a hypercube multicomputer

    Page(s): 451 - 464
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1236 KB)  

    The performance evaluation, workload characterization, and trace-driven simulation of a hypercube multicomputer running realistic workloads are presented. Eleven representative parallel applications were selected as benchmarks. Software monitoring techniques were then used to collect execution traces. Based on the measurement results, both the computation and communication behavior of these parallel programs were investigated. The various time interval distributions were modeled by statistical functions which were verified by a nonlinear regression technique using the empirical data. The temporal and spatial localities of message destinations were also studied. A model for the temporal locality of message length was introduced and used to analyze the communication traces. A trace-drive simulation environment, which uses the communication patterns of the parallel programs as inputs, was developed to study the behavior of the communication hardware under real workload. Simulation results on DMA and link utilizations are reported View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Implementation of production systems on message-passing computers

    Page(s): 477 - 487
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1024 KB)  

    The authors examine the suitability of message-passing computers for parallel implementations of production systems. Two mappings for production systems on these computers, one targeted toward fine-grained message-passing machines and the other targeted toward medium-grained machines, are presented. Simulation results for the medium-grained mapping are presented, and it is shown that it is possible to exploit the available parallelism and to obtain reasonable speedups. The authors perform a detailed analysis of the results and suggest solutions for some of the problems View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Synchronization and communication costs of loop partitioning on shared-memory multiprocessor systems

    Page(s): 505 - 512
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (744 KB)  

    The author presents strategies for static loop decomposition and scheduling as well as computer-assisted run-time scheduling that take into account, in addition to the cost of performing operations, the overhead costs associated with a decomposition and schedule. An algorithm for static decomposition of multidimensional loops based on the operation execution costs, communication costs, and synchronization costs is discussed. Synchronization instructions are introduced to ensure correct program execution following program decomposition. An algorithm for determining the explicit synchronization instruction that should be introduced to ensure correct execution of a program with arbitrarily nested loops is presented. Techniques for reducing run-time scheduling and communication and synchronization costs due to self-scheduling of multidimensional loops are also presented. Experiments performed on the Encore multiprocessor system demonstrate that the techniques developed can reduce overhead costs View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Optimal broadcasting on the star graph

    Page(s): 389 - 396
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (760 KB)  

    The star graph has been show to be an attractive alternative to the widely used n-cube. Like the n-cube, the star graph possesses rich structure and symmetry as well as fault tolerant capabilities, but has a smaller diameter and degree. However, very few algorithms exists to show its potential as a multiprocessor interconnection network. Many fast and efficient parallel algorithms require broadcasting as a basic step. An optimal algorithm for one-to-all broadcasting in the star graph is proposed. The algorithm can broadcast a message to N processors in O(log2 N) time. The algorithm exploits the rich structure of the star graph and works by recursively partitioning the original star graph into smaller star graphs. In addition, an optimal all-to-all broadcasting algorithm is developed View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • An analysis of cache performance for a hypercube multicomputer

    Page(s): 421 - 432
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1072 KB)  

    Multicomputer cache simulation results derived from address traces collected from an Intel iPSC/2 hypercube multicomponent are presented. The primary emphasis is on examining how increasing the number of processor nodes executing a parallel application affects the overall multicomputer cache performance. The effects on multicomputer direct-mapped cache performance of application-specific data partitioning, data access patterns, communication distribution, and communication frequency are illustrated. The effects of system accesses on total cache performance are explored, as well as the reasons for application-specific differences in cache behavior for system and user accesses. Comparing user code results with full user and system code analysis reveals the significant effect of system accesses, and this effect increases with multicomputer size. The time distribution of an application's message-passing operations is found to more strongly affect cache performance than the total amount of time spent in message-passing code View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Reduction operations on a distributed memory machine with a reconfigurable interconnection network

    Page(s): 500 - 505
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (504 KB)  

    Performing reduction operations with distributed memory machines whose interconnection networks are reconfigurable is considered. The focus is on machines whose interconnection graph can be configured as any graph of maximum degree d. The best way of interconnecting the p processors as a function of p,d and some problem- and machine-dependent parameters that characterize the ratio communication/arithmetic for the reduction operation are discussed. Experiments on transputer-based networks are in good accordance with the theoretical results View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient task migration algorithm for distributed systems

    Page(s): 488 - 499
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (840 KB)  

    The objective of the study was to achieve balanced load among processors, reduce the communication overhead of the load balancing algorithm, and improve respource utilization, which results in better average resonse time. A communication protocol and a fully distributed algorithm for dynamic load balancing through task migration in a connected N-processor network are presented. Each processor communicates its load directly with only a subset (of the size √ N) of processors, reducing communication traffic and average response time. It is proved that the given algorithm will perform task migration even if there is only one light load processor and one heavy load processor in the system. Simulation results show that the proposed scheme can save up to 60% of the protocol messages used by the broadcast algorithms and can reduce the average response time View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
David Bader
College of Computing
Georgia Institute of Technology