Abstract
As high end computing systems continue to scale in CPU computational power and overall node count, optimization techniques that can reduce communication overhead have proven important. We present a loop optimization framework designed to achieve both efficient communication/computation overlap and performance portability. The framework has been implemented in the Berkeley UPC compiler and uses a combination of compile time analysis and runtime mechanisms. We extend the compiler to perform message vectorization and message strip mining optimizations. At compile time loop nests are analyzed, their communication requirements are determined, and the computation overhead is estimated. The compiler passes analysis information to the runtime and performance portability is achieved by decoupling data movement from local computation. We generate template code that uses the transferred data without making any assumptions about the communication mechanism.
Index
Terms
Available to subscribers and IEEE members.
References
Available to subscribers and IEEE members.
Citing Documents
Available to subscribers and IEEE members.