By Topic

Parallelizing the Hamiltonian Computation in DQMC Simulations: Checkerboard Method for Sparse Matrix Exponentials on Multicore and GPU

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Che-Rung Lee ; Dept. Comput. Sci., Nat. Tsing Hua Univ., Hsinchu, Taiwan ; Zhi-Hung Chen ; Quey-Liang Kao

Determinant Quantum Monte Carlo (DQMC) simulation is one of few numerical methods that can explore the micro properties of fermions, which has many technically important applications in chemistry and material science. Conventionally, its parallelization relies on parallel Monte Carlo method, whose speedup is limited by the thermalization process and the underlying matrix computation. To achieve better performance, fine-grained parallelization on its numerical kernel is essential to utilize the massive parallel processing units, which are multicores and/or GPUs interconnected by high performance network. In this paper, we address the parallelization on one of the matrix kernel in the DQMC simulations: the multiplication of matrix exponentials. The matrix is derived from the kinetic Hamiltonian, which is highly sparse. We approximate its exponential by the checkerboard method, which decomposes the matrix exponential into a product of a sequence of block sparse matrices. We analyze the block sparse matrices of two common used lattice geometry: 2D torus and 3D cubic, and parallelize the computational kernel of multiplying them to a general matrix. The parallel algorithm is designed for multicore CPU and GPU. The results of experiments showed on a quad core processor, 3 times speedup can be observed in average, and on GPU, 145 times speedup is achievable.

Published in:

Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International

Date of Conference:

21-25 May 2012