Skip to Main Content
Quantum Monte Carlo (QMC) simulations for the recent studies on complex materials were confronted by new computational challenges. Traditional approach to accelerate the simulations by parallel Monte Carlo chains faces serious scalability problems since the speedup is reaching the limitation predicted by Amdahl's law. Fine-grained parallelization of matrix kernels is essential to achieve better performance. In this paper, we investigate the performance optimization techniques on GPU for the most time consuming computational kernel in the Determinant Quantum Monte Carlo (DQMC) simulation: multiplication of matrix exponentials. The matrix, derived from the kinetic Hamiltonian, is highly sparse, and its exponential is approximated by the block checkerboard method, which can represent a matrix exponential as a product of a sequence of sparse matrices. The matrix exponentials from 2D and 3D toruses are focused, and various optimization techniques, such as data streaming and concurrent kernels, are proposed. Experiments show that the proposed optimization techniques can improve the SpMM (Sparse Matrix Multiplication) function, modified from the CUDA SKD SpMV function, up to 16 times and 117 times for 2D and 3D problems respectively.