Skip to Main Content
Dynamic Programming (DP) is a method for efficiently solving a broad range of search and optimization problems. As a result, techniques for managing large-scale DP problems are often critical to the performance of many applications. DP algorithms are often hard to parallelize. In this paper, we address the challenge of exploiting fine grain parallelism on a family of DP algorithms known as non-serial polyadic. We use an abstract formulation of non-serial polyadic DP, derived from RNA secondary structure prediction and matrix parenthesization approaches that are well-known and important problems from this family. We present a load balancing algorithm that achieves the best overall performance with this type of workload on many-core architectures. A divide-and-conquer approach previously used on multi-core architectures is compared against an iterative version. To evaluate these approaches, the algorithm was implemented on three NVIDIA GPUs using CUDA. We achieved up to 10 GFLOP/s performance and up to 228x speedup over the single-threaded CPU implementation. Moreover, the iterative approach results in up to 3.92x speedup over the divide-and-conquer approach.