Abstract:
A significant trend in machine learning is sparsifying the training of neural networks to reduce the amount of computation required. Algorithms like Sub-LInear Deep learn...Show MoreMetadata
Abstract:
A significant trend in machine learning is sparsifying the training of neural networks to reduce the amount of computation required. Algorithms like Sub-LInear Deep learning Engine (SLIDE) [2] use locality-sensitive hashing (LSH) to create sparsity. These sparse training algorithms were originally developed on multi-threaded multicore CPUs. However, they are not well-studied and optimized for accelerator platforms such as GPUs and reconfigurable dataflow architectures (RDAs). In this paper, we study the different variants of the SLIDE algorithm and investigate accuracy-performance tradeoffs on CPU, GPU, and RDAs. The implementation targeting RDA outperforms the GPU by 7.5×. The performance on a limited-memory RDA is improved further by proposing a smart caching algorithm, which is 2 × faster than the baseline RDA. Furthermore, we are able to achieve another 2 × performance by putting all of the weights on-chip using an RDA with enough memory. We believe our work will pave the road for the future development of both algorithm and hardware architecture for sparse training.
Published in: 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Date of Conference: 30 May 2022 - 03 June 2022
Date Added to IEEE Xplore: 01 August 2022
ISBN Information: