Skip to Main Content
We propose four application-level implementation alternatives called algorithm styles and construct highly optimized implementations on two parallel platforms: an Intel Core i7 multicore processor and a NVIDIA GTX280 manycore processor. The highest performing algorithm style varies with the implementation platform. On a 44-min speech data set, we demonstrate substantial speedups of 3.4 X on Core i7 and 10.5 X on GTX280 compared to a highly optimized sequential implementation on Core i7 without sacrificing accuracy. The parallel implementations contain less than 2.5% sequential overhead, promising scalability and significant potential for further speedup on future platforms.