Abstract:
Stencil computation sweeps over a spatial grid over multiple time steps to perform nearest-neighbor computations. The bandwidth-to-compute requirement for a large class o...Show MoreMetadata
Abstract:
Stencil computation sweeps over a spatial grid over multiple time steps to perform nearest-neighbor computations. The bandwidth-to-compute requirement for a large class of stencil kernels is very high, and their performance is bound by the available memory bandwidth. Since memory bandwidth grows slower than compute, the performance of stencil kernels will not scale with increasing compute density. We present a novel 3.5D-blocking algorithm that performs 2.5D-spatial and temporal blocking of the input grid into on-chip memory for both CPUs and GPUs. The resultant algorithm is amenable to both thread- level and data-level parallelism, and scales near-linearly with the SIMD width and multiple-cores. Our performance numbers are faster or comparable to state-of-the-art-stencil implementations on CPUs and GPUs. Our implementation of 7-point-stencil is 1.5X-faster on CPUs, and 1.8X faster on GPUs for single- precision floating point inputs than previously reported numbers. For Lattice Boltzmann methods, the corresponding speedup number on CPUs is 2.1X.
Date of Conference: 13-19 November 2010
Date Added to IEEE Xplore: 29 November 2010
ISBN Information:
ISSN Information:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Graphics Processing Unit ,
- Modern Graphics Processing Units ,
- Stencil Computations ,
- Time Step ,
- Parallel Data ,
- Single Precision ,
- Memory Bandwidth ,
- Lattice Boltzmann Method ,
- On-chip Memory ,
- Multiple Time Steps ,
- Computational Resources ,
- Parallelization ,
- Spatial Dimensions ,
- Grid Points ,
- Block Size ,
- Time Instants ,
- Core I7 ,
- Modern Architecture ,
- Data Reuse ,
- Double Precision ,
- Spatial Block ,
- Grid Elements ,
- External Memory ,
- Shared Memory ,
- Bandwidth Requirements ,
- Cache Size ,
- Edge Points ,
- Data Cache ,
- Effective Bandwidth ,
- L2 Cache
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Graphics Processing Unit ,
- Modern Graphics Processing Units ,
- Stencil Computations ,
- Time Step ,
- Parallel Data ,
- Single Precision ,
- Memory Bandwidth ,
- Lattice Boltzmann Method ,
- On-chip Memory ,
- Multiple Time Steps ,
- Computational Resources ,
- Parallelization ,
- Spatial Dimensions ,
- Grid Points ,
- Block Size ,
- Time Instants ,
- Core I7 ,
- Modern Architecture ,
- Data Reuse ,
- Double Precision ,
- Spatial Block ,
- Grid Elements ,
- External Memory ,
- Shared Memory ,
- Bandwidth Requirements ,
- Cache Size ,
- Edge Points ,
- Data Cache ,
- Effective Bandwidth ,
- L2 Cache