Abstract:
The complexity of current Deep learning has been growing rapidly nowadays. Such advancement allows various organizations such as private sectors and government to leverag...Show MoreMetadata
Abstract:
The complexity of current Deep learning has been growing rapidly nowadays. Such advancement allows various organizations such as private sectors and government to leverage intelligent systems on their use cases. High Performance Computing (HPC) infrastructure nowadays has pivoted to GPU-oriented systems, enabling developers and researchers to train complex models with large datasets unlike conventional clusters equipped only with CPU cores. However, focus on power efficiency on the HPC system has not been prevalent especially on the new system such as DGX A100 that does not have datapoints on how GPUs consumed power. Even though such HPC cluster can be powerful, always allowing it to run at the maximum capacity results to financial cost to the HPC provider at the end. Therefore, for any organization providing the system, it is crucial for them to balance the cluster capabilities while maintaining overall power consumption which can potentially be costly in the long term. This paper reveals A100 GPU metrics that are relevant to Power usage and explains GPU profiling applied to Deep learning workload on the cluster, saving up to 32% of the power usage while compromising only 11.5% of training time compared to a default profile. Then, the paper investigates literature review that could be learned further adopted to the current system at CMKL university as the next milestone.
Published in: 2022 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)
Date of Conference: 13-16 December 2022
Date Added to IEEE Xplore: 09 January 2023
ISBN Information:
Electronic ISSN: 2330-2186