Conferences >2021 IEEE International Confe...

Streamlining distributed Deep Learning I/O with ad hoc file systems

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

With evolving techniques to parallelize Deep Learning (DL) and the growing amount of training data and model complexity, High-Performance Computing (HPC) has become incre...Show More

Metadata

Abstract:

With evolving techniques to parallelize Deep Learning (DL) and the growing amount of training data and model complexity, High-Performance Computing (HPC) has become increasingly important for machine learning engineers. Although many compute clusters already use learning accelerators or GPUs, HPC storage systems are not suitable for the I/O requirements of DL workflows. Therefore, users typically copy the whole training data to the worker nodes or distribute partitions. Because DL depends on randomized input data, prior work stated that partitioning impacts DL accuracy. Their solutions focused mainly on training I/O performance on a high-speed network but did not cover the data stage-in process, for example. We show in this paper that, in practice, (unbiased) partitioning is not harmful for distributed DL accuracy. Nevertheless, manual partitioning can be error prone and inefficient. Typically, data must be unpacked and shuffled before it is distributed to nodes. We propose a solution that features both: efficient stage-in and fast access to a global namespace to prevent biases. Our architecture is based around an ad hoc storage system relying on a high-speed interconnect allowing an efficient stage-in of DL data sets into a single global namespace. Our proposed solution does not limit access to parts of the data set or relies on data duplication, also relieving the HPC storage system. We obtain high I/O performance during training and ensure minimal interference with communication of the learning workers. The optimizations are transparent to DL applications and their accuracy is not affected by our architecture.

Published in: 2021 IEEE International Conference on Cluster Computing (CLUSTER)

Date of Conference: 07-10 September 2021

Date Added to IEEE Xplore: 13 October 2021

ISBN Information:

ISSN Information:

DOI: 10.1109/Cluster48925.2021.00062

Conference Location: Portland, OR, USA

Funding Agency:

Contents

I. Introduction

Deep Learning (DL) techniques are used by an increasing range of scientific disciplines and are responsible for many innovations in both industry and research. The fundamental technologies and mechanisms in DL and other data-driven applications rely on ever-increasing huge data sets, raising the technical demands on systems that run such applications. The European Centre for Medium-Range Weather Forecasts (ECMWF) reported a 45% growth for over 100 petabytes of data already in 2014 [1]. In climate and earth sciences, for example, prediction models rely on training of terabytes of data [2 – 5]. Processing such huge amounts of data is therefore one of the major challenges of these fields.

References is not available for this document.

Streamlining distributed Deep Learning I/O with ad hoc file systems

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Streamlining distributed Deep Learning I/O with ad hoc file systems

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

Authors

Figures

References

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?