nDirect2: A High-Performance Library for Direct Convolutions on Multicore CPUs | IEEE Journals & Magazine | IEEE Xplore

nDirect2: A High-Performance Library for Direct Convolutions on Multicore CPUs


Abstract:

Convolution kernels are widely seen in high-performance computing (HPC) and deep learning (DL) workloads and are often responsible for performance bottlenecks. Prior work...Show More

Abstract:

Convolution kernels are widely seen in high-performance computing (HPC) and deep learning (DL) workloads and are often responsible for performance bottlenecks. Prior works have demonstrated that the direct convolution approach can outperform the conventional convolution implementation. Although well-studied, the existing approaches for direct convolution are either incompatible with the mainstream DL data layouts or lead to suboptimal performance. We design nDirect2, a novel direct convolution approach that targets multi-core CPUs commonly found in smartphones and HPC systems. nDirect2 is compatible with the data layout formats used by mainstream DL frameworks and offers new optimizations for the computational kernel, data packing, advanced operator fusion, and parallelization. We evaluate nDirect2 by applying it to representative convolution kernels and demonstrating how well it performs on four distinct ARM-based CPUs and an X86-based CPU. Experimental results show that nDirect2 outperforms four state-of-the-art convolution approaches across most evaluation cases and hardware architectures.
Published in: IEEE Transactions on Computers ( Volume: 74, Issue: 6, June 2025)
Page(s): 1829 - 1843
Date of Publication: 19 February 2025

ISSN Information:

Funding Agency:

No metrics found for this document.

I. Introduction

Despite the recent success of Transformer-based neural networks, convolutional-based neural networks (CNNs) remain widely used for image classification [1], object detection [2], and natural language processing [3]. At the heart of CNNs is the convolution operation (Conv) [4], which often represents more than 90% of the execution time in classical CNN models like VGG [5]. As such, considerable interest has been put in optimizing convolution implementations to accelerate CNNs [6], [7].

Usage
Select a Year
2025

View as

Total usage sinceFeb 2025:128
020406080100JanFebMarAprMayJunJulAugSepOctNovDec016184900000000
Year Total:128
Data is updated monthly. Usage includes PDF downloads and HTML views.

Contact IEEE to Subscribe

References

References is not available for this document.