Abstract:
Long short-term memory (LSTM) is a type of powerful deep neural network that has been widely used in many sequence analysis and modeling applications. However, the large ...Show MoreMetadata
Abstract:
Long short-term memory (LSTM) is a type of powerful deep neural network that has been widely used in many sequence analysis and modeling applications. However, the large model size problem of LSTM networks make their practical deployment still very challenging, especially for the video recognition tasks that require high-dimensional input data. Aiming to overcome this limitation and fully unlock the potentials of LSTM models, in this paper we propose to perform algorithm and hardware co-design towards high-performance energy-efficient LSTM networks. At algorithm level, we propose to develop fully decomposed hierarchical Tucker (FDHT) structure-based LSTM, namely FDHT-LSTM, which enjoys ultra-low model complexity while still achieving high accuracy. In order to fully reap such attractive algorithmic benefit, we further develop the corresponding customized hardware architecture to support the efficient execution of the proposed FDHT-LSTM model. With the delicate design of memory access scheme, the complicated matrix transformation can be efficiently supported by the underlying hardware without any access conflict in an on-the-fly way. Our evaluation results show that both the proposed ultra-compact FDHT-LSTM models and the corresponding hardware accelerator achieve very high performance. Compared with the state-of-the-art compressed LSTM models, FDHT-LSTM enjoys both order-of-magnitude reduction (more than 1000 \times) in model size and significant accuracy improvement (0.6% to 12.7%) across different video recognition datasets. Meanwhile, compared with the state-of-the-art tensor decomposed model-oriented hardware TIE, our proposed FDHT-LSTM architecture achieve 2.5\times, 1.46\times and 2.41\times increase in throughput, area efficiency and energy efficiency, respectively on LSTM-Youtube workload. For LSTM-UCF workload, our proposed design also outperforms TIE with 1.9\times higher throughput, 1.83\times higher energy efficiency and comparable area effic...
Published in: IEEE Transactions on Computers ( Volume: 71, Issue: 12, 01 December 2022)
Funding Agency:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Long Short-term Memory ,
- Long Short-term Memory Network ,
- Tensor Decomposition ,
- Video Recognition ,
- Hardware Co-design ,
- High Performance ,
- Deep Neural Network ,
- Energy Efficiency ,
- Short-term Memory ,
- Increase In Efficiency ,
- Model Size ,
- Transformation Matrix ,
- High Energy Efficiency ,
- Long Short-term Memory Model ,
- Hardware Architecture ,
- Hardware Accelerators ,
- Increase Energy Efficiency ,
- Increase In Throughput ,
- Convolutional Neural Network ,
- Weight Matrix ,
- Entire Model ,
- Order Tensor ,
- Space Complexity ,
- Compression Ratio ,
- Smaller Model Size ,
- Binary Tree ,
- Higher-order Tensors ,
- Row Index ,
- Linear Layer ,
- Tensor Factorization
- Author Keywords
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Long Short-term Memory ,
- Long Short-term Memory Network ,
- Tensor Decomposition ,
- Video Recognition ,
- Hardware Co-design ,
- High Performance ,
- Deep Neural Network ,
- Energy Efficiency ,
- Short-term Memory ,
- Increase In Efficiency ,
- Model Size ,
- Transformation Matrix ,
- High Energy Efficiency ,
- Long Short-term Memory Model ,
- Hardware Architecture ,
- Hardware Accelerators ,
- Increase Energy Efficiency ,
- Increase In Throughput ,
- Convolutional Neural Network ,
- Weight Matrix ,
- Entire Model ,
- Order Tensor ,
- Space Complexity ,
- Compression Ratio ,
- Smaller Model Size ,
- Binary Tree ,
- Higher-order Tensors ,
- Row Index ,
- Linear Layer ,
- Tensor Factorization
- Author Keywords