Bilateral filtering is an ubiquitous tool for several kinds of image processing applications. This work explores multicore and many core accelerations for the embarrassingly parallel yet compute-intensive bilateral filtering kernel. For many core architectures, we have created a novel pair-symmetric algorithm to avoid redundant calculations. For multicore architectures, we improve the algorithm by use of low-level single instruction multiple data (SIMD) parallelism across multiple threads. We propose architecture specific optimizations, such as exploiting the unique capabilities of special registers available in modern multicore architectures and the rearrangement of data access patterns as per the computations to exploit special purpose instructions. We also propose optimizations pertinent to Nvidia's Compute Unified Device Architecture (CUDA), including utilization of CUDA's implicit synchronization capability and the maximization of single-instruction-multiple-thread efficiency. We present empirical data on the performance gains we achieved over a variety of hardware architectures including Nvidia GTX 280, AMD Barcelona, AMD Shanghai, Intel Harper town, AMD Phenom, Intel Core i7 quad core, and Intel Nehalem 32 core machines. The best performance achieved was (i) 169-fold speedup by the CUDA-based implementation of our pair-symmetric algorithm running on Nvidia's GTX 280 GPU compared to the compiler-optimized sequential code on Intel Core i7, and (ii) 38-fold speedup using 16 cores of AMD Barcelona each equipped with a 4-stage vector pipeline compared to the compiler-optimized sequential code running on the same machine.
Published in:
Parallel Processing (ICPP), 2012 41st International Conference on
Date of Conference: 10-13 Sept. 2012