By Topic

Run-time optimization of sends, receives and file I/O

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Thorvald Natvig ; Norwegian University of Science and Technology(NTNU), Sem Sælands vei 9, NO-7491 Trondheim, Norway ; Anne C. Elster

On today's commodity clusters, achieving near-optimal speedup is very hard. We have previously shown that parallel applications using synchronous sequential MPI calls can be optimized by runtime replacement with corresponding asynchronous MPI operations. This may be achieved by protecting memory used by the MPI call from application writes until the operation is complete. However, changing the protection bits of memory pages, which alters the page table and flushes the CPU caches, may introduce a significant overhead. In the case of overlapping requests, which are common for applications with domain decompositions using border exchanges of ghost cells, this would have to be done multiple times for each communications phase. In this paper, we overcome this problem by mapping the same memory twice into the virtual address space, but with different protection bits set for each view. This allows us to optimize overlapping MPI requests without changing the page protection bits until all the requests are finished. The same method is also extended to cover basic file I/O, overlapping file operations with computation without rewriting the original application. We also add distributed locking to I/O operations, which allows aggressive read-ahead and write-merging for parallel applications, reducing the wall-clock time of the file I/O phases that commonly surround the computation of the solution.

Published in:

Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS), 2010 IEEE International Conference on

Date of Conference:

20-24 Sept. 2010