By Topic

Compiler supported coarse-grained pipelined parallelism: why and how

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Wei Du ; Dept. of Comput. & Inf. Sci., Ohio State Univ., Columbus, OH, USA ; G. Agrawal

The emergence of grid and a new class of data-driven applications is making a new form of parallelism desirable, which we refer to as coarse-grained pipelined parallelism. Here, the computations associated with an application are carried out in several stages, which are executed on a pipeline of computing units. This paper reports on a compilation system developed to exploit this form of parallelism. We use a dialect of Java that exposes both pipelined and data parallelism to the compiler. Our compiler is responsible for selecting a set of candidate filter boundaries, determining the volume of communication required if a particular boundary is chosen, performing the decomposition, and generating code in which each filter unpacks data from a received buffer, iterates over its elements, and packs and forwards a buffer to the next stage. We have developed a one-pass algorithm for determining the required communication between consecutive filters. We have developed a cost model for estimating the execution time for a given decomposition, and a greedy algorithm for performing the decomposition. We have carried out a detailed evaluation of our current compiler using four data-driven applications. Our experimental results show the following: (1) The compiler decomposed versions achieve an improvement between 10% and 150% over versions that use pipelined parallelism in a default fashion, (2) In most cases, increasing the width of the pipeline results in near-linear speedups, and (3) For the two applications where we could compare against a manual version, the compiler generated versions were generally quite close to the manual versions.

Published in:

Parallel and Distributed Processing Symposium, 2003. Proceedings. International

Date of Conference:

22-26 April 2003