Abstract:
Transformers are a class of machine learning models that have piqued high interest recently due to a multitude of reasons. They can process multiple modalities efficientl...Show MoreMetadata
Abstract:
Transformers are a class of machine learning models that have piqued high interest recently due to a multitude of reasons. They can process multiple modalities efficiently and have excellent scalability. Despite these obvious advantages, training these large models is very time-consuming. Hence, there have been efforts to speed up the training process using efficient distributed implementations. Many different types of parallelism have been identified that can be employed standalone or in combination. However, naively combining different parallelization schemes can incur significant communication overheads, thereby potentially defeating the purpose of distributed training. Thus, it becomes vital to predict the right mapping of different parallelisms to the underlying system architecture. In this work, we propose AMPeD, an analytical model for performance in distributed training of transformers. It exposes all the transformer model parameters, potential parallelism choices (along with their mapping onto the system), the accelerator as well as system architecture specffications as tunable knobs, thereby enabling hardware-software co-design. With the help of 3 case studies, we show that the combinations of parallelisms predicted to be efficient by AMPeD conform with the results from the state-of-the-art literature. Using AMPeD, we also show that future distributed systems consisting of optical communication substrates can train large models up to 4× faster as compared to the current state-of the-art systems without modifying the peak computational power of the accelerators. Finally, we validate AMPeD with in-house experiments on real systems and via published literature. The max. observed error is limited to 12%. The model is available here: https://github.com/CSA-infra/AMPeD
Published in: 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
Date of Conference: 23-25 April 2023
Date Added to IEEE Xplore: 23 June 2023
ISBN Information: