Exploiting Online Locality and Reduction Parallelism for Sampled Dense Matrix Multiplication on GPUs | IEEE Conference Publication | IEEE Xplore