By Topic

Fast parallel cutoff pair interactions for molecular dynamics on heterogeneous systems

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

The purchase and pricing options are temporarily unavailable. Please try again later.
4 Author(s)
Qiang Wu ; School of Computer Science, National University of Defense Technology, Changsha 410073, China ; Canqun Yang ; Tao Tang ; Kai Lu

Heterogeneous systems with both Central Processing Units (CPUs) and Graphics Processing Units (GPUs) are frequently used to accelerate short-ranged Molecular Dynamics (MD) simulations. The most time-consuming task in short-ranged MD simulations is the computation of particle-to-particle interactions. Beyond a certain distance, these interactions decrease to zero. To minimize the operations to investigate distance, previous works have tiled interactions by employing the spatial attribute, which increases the memory access and GPU computations, hence decreasing performance. Other studies ignore the spatial attribute and construct an all-versus-all interaction matrix, which has poor scalability. This paper presents an improved algorithm. The algorithm first bins particles into voxels according to the spatial attributes, and then tiles the all-versus-all matrix into voxel-versus-voxel sub-matrixes. Only the sub-matrixes between neighboring voxels are computed on the GPU. Therefore, the algorithm reduces the distance examine operations and limits additional memory access and GPU computations. This paper also adopts a multi-level programming model to implement the algorithm on multi-nodes of Tianhe-lA. By employing (1) a patch design to exploit parallelism across the simulation domain, (2) a communication overlapping method to overlap the communications between CPUs and GPUs, and (3) a dynamic workload balancing method to adjust the workloads among compute nodes, the implementation achieves a speedup of 4.16× on one NVIDIA Tesla M2050 GPU compared to a 2.93 GHz six-core Intel Xeon X5670 CPU. In addition, it runs 2.41× faster on 256 compute nodes of Tianhe-lA (with two CPUs and one GPU inside a node) than on 256 GPU-excluded nodes.

Published in:

Tsinghua Science and Technology  (Volume:17 ,  Issue: 3 )