Minisymposium Presentation

Efficient Training of GNN-based Material Science Applications at Scale: An Orchestration of Data Movement Approach

Tuesday, June 4, 2024

12:00

12:30

CEST

Climate, Weather and Earth Sciences

Chemistry and Materials

Computer Science and Applied Mathematics

Engineering

Life Sciences

Physics

Presenter

Khaled

Ibrahim

Lawrence Berkeley National Laboratory

Khaled Ibrahim is a staff scientist with the Applied Mathematics and Computational Research Division at Lawrence Berkeley National Laboratory. His research interests include performance modeling and code optimization for high performance computing and machine learning applications. He has also done extensive research in communication runtime optimizations.

Watch replay

Description

Scalable data management techniques are crucial to effectively processing large volumes of scientific data on HPC platforms for distributed deep learning (DL) model training. Because of the need to access data randomly and frequently in stochastic optimizers, in-memory distributed storage that keeps the dataset in the local memory of each computing node is widely adopted over file-based I/O for its rapid speed. In this presentation, we discuss the tradeoff of various data exchange mechanisms. We present a hybrid in-memory data loader with multiple communication backends for distributed graph neural network training. We introduce a model-driven performance estimator to switch between communication mechanisms automatically at runtime. The performance estimator uses Tree of Parzen Estimators (TPE), a Bayesian Optimization method, to optimize model parameters and dynamically select the most efficient communication method for data loading. We present our evaluation on two US DOE supercomputers, NERSC Perlmutter and OLCF Summit, on a wide set of runtime configurations. Our optimized implementation outperforms a baseline using single-backend loaders by up to 2.83x and can accurately predict the suitable communication method with an average success rate of 96.3% (Perlmutter) and 94.3% (Summit).