By Topic

Acceleration of Bilateral Filtering Algorithm for Manycore and Multicore Architectures

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)
Agarwal, D. ; Dept. of Comput. Sci., Georgia State Univ., Atlanta, GA, USA ; Wilf, S. ; Dhungel, A. ; Prasad, S.K.

Bilateral filtering is an ubiquitous tool for several kinds of image processing applications. This work explores multicore and many core accelerations for the embarrassingly parallel yet compute-intensive bilateral filtering kernel. For many core architectures, we have created a novel pair-symmetric algorithm to avoid redundant calculations. For multicore architectures, we improve the algorithm by use of low-level single instruction multiple data (SIMD) parallelism across multiple threads. We propose architecture specific optimizations, such as exploiting the unique capabilities of special registers available in modern multicore architectures and the rearrangement of data access patterns as per the computations to exploit special purpose instructions. We also propose optimizations pertinent to Nvidia's Compute Unified Device Architecture (CUDA), including utilization of CUDA's implicit synchronization capability and the maximization of single-instruction-multiple-thread efficiency. We present empirical data on the performance gains we achieved over a variety of hardware architectures including Nvidia GTX 280, AMD Barcelona, AMD Shanghai, Intel Harper town, AMD Phenom, Intel Core i7 quad core, and Intel Nehalem 32 core machines. The best performance achieved was (i) 169-fold speedup by the CUDA-based implementation of our pair-symmetric algorithm running on Nvidia's GTX 280 GPU compared to the compiler-optimized sequential code on Intel Core i7, and (ii) 38-fold speedup using 16 cores of AMD Barcelona each equipped with a 4-stage vector pipeline compared to the compiler-optimized sequential code running on the same machine.

Published in:

Parallel Processing (ICPP), 2012 41st International Conference on

Date of Conference:

10-13 Sept. 2012