Abstract:
Compute Unified Device Architecture (CUDA) was developed as a GPU parallel programming platform and API, primarily designed for use with C/C++. Over the years, fundamenta...Show MoreMetadata
Abstract:
Compute Unified Device Architecture (CUDA) was developed as a GPU parallel programming platform and API, primarily designed for use with C/C++. Over the years, fundamental linear algebra functionalities on CUDA have reached a mature state, and many of these are now accessible on CUDA’s GitHub repository. As other high-level programming languages have begun incorporating CUDA-compatible methods into their libraries, the Julia Programming Language introduced CUDA support in 2021, aiming to offer an abstraction level similar to that of C implementations. However, research has shown that Julia’s linear algebra computations— despite leveraging CUDA for parallelization and computational reduction—have yet to match the execution speed achieved by C implementations. This study uses matrix multiplication as a representative linear algebra computation, given its well-optimized CUDA kernel. Outputs of the study include an NSight report file and an SQLite database, which are analysed using NVIDIA Nsight Systems to assess each kernel's runtime and memory usage for performance evaluation. Findings indicate that Julia’s CUDA kernel invocation has a high runtime overhead, growing at a rate of O(n2), which presents a bottleneck when performing high-throughput computations on square binary matrices. This paper suggests that resolving this issue may involve developing a custom CUDA kernel in Julia that employs a more efficient reduction technique to reduce overhead and enhance performance.
Published in: 2024 IEEE 17th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)
Date of Conference: 16-19 December 2024
Date Added to IEEE Xplore: 03 January 2025
ISBN Information: