Skip to Main Content
Programming for CUDA devices presents the paradigm of warp-synchronous nature. This paper covers an implementation of an AES decryption kernel that makes use of warp-synchronicity. The operation of the various types of memories found in a CUDA device also required some analysis for a warp-synchronous implementation. The CUDA based implementation of AES256 was tested on several CUDA devices of various compute capabilities against the original single-threaded CPU implementation on various machines using encrypted messages of sizes ranging from 2MiB to 1GiB. The time taken for the GPUs to decrypt a message has varied between GPU architectures, with some achieving a 6x speedup.