Minisymposium
MS3C - Scalable Machine Learning and Generative AI for Materials Design
Replay
Session Chair
Description
The design and discovery of materials with desired functional properties is challenging due to labor-intensive experimental measurements and computationally expensive physics-based models, which preclude a thorough exploration of large chemical spaces characterized by several chemical compositions and atomic configurations per composition. This disconnect has motivated the development of data-driven surrogate models that can overcome experimental and computational bottlenecks to enable an effective exploration of such vast chemical spaces. In this minisymposium, we discuss new generative artificial intelligence (AI) methods to perform materials design. A particular advantage of generative AI approaches is their ability to learn the context and syntax of molecular data described by fundamental principles of physics and chemistry, providing a critical basis for informing the generative design of molecules. In order to ensure generalizability and robustness of the generative model, the generative AI model needs to be trained on a large volume of data that thoroughly samples diverse chemical regions. Due to the large volumes of data that must be processed, efficiently training these models requires leveraging a massive amount of high performance computing (HPC) resources for scalable training. This minisymposium aims to broadly cover HPC aspects for scalable generative AI models across several heterogeneous distributed computational environments.
Presentations
We discuss the challenges involved in developing large-scale training for generative AI models aimed at material design. We employ HydraGNN, a scalable graph neural network (GNN) framework, alongside DDStore, a distributed in-memory data store, to facilitate large-scale data distribution across the supercomputing resources provided by the US Department of Energy (DOE). Our discussion includes insights into our implementation and the notable decrease in I/O overhead within HPC environments. The effectiveness of HydraGNN and DDStore is showcased through its application for molecular design, where a GNN model learns to predict the ultraviolet-visible spectrum based on a dataset comprising over 10 million molecules. By enabling efficient training scale-up to thousands of GPUs on the Summit and Perlmutter supercomputers, DDStore has achieved a significant boost in DL training speed, recording up to a 6.15 times faster performance than our initial methods. We will discuss the performance advancements on the new Frontier supercomputer at the Oak Ridge National Laboratory (ORNL), highlighting the evolving landscape of supercomputing in AI research.
From seashells to mammal hooves to plant stems, biological materials have long captivated materials scientists and mechanical engineers due to their impressive hierarchical structure-property relationships. By understanding biological insights and motifs, the design of bio-inspired materials is empowered and poised to benefit a diverse range of applications, including sustainability. Modern generative AI frameworks, especially large language models (LLMs), show remarkable potential for science-focused applications, excelling notably in the study of biological materials through the utilization of rich legacy literature. We present BioinspiredLLM, an open-source conversational large language model that was finetuned on a corpus of biological materials literature. The model shows strong abilities in knowledge recall, creative hypothesis generation, and seamless integration into multi-agent systems. Multi-agent/agentic systems facilitate the interaction of multiple advanced AI systems, thereby expanding the scope of knowledge, enhancing data retrieval capabilities, and fostering critical thinking. This approach is demonstrated through multiple bio-inspired materials design scenarios.
Scalable data management techniques are crucial to effectively processing large volumes of scientific data on HPC platforms for distributed deep learning (DL) model training. Because of the need to access data randomly and frequently in stochastic optimizers, in-memory distributed storage that keeps the dataset in the local memory of each computing node is widely adopted over file-based I/O for its rapid speed. In this presentation, we discuss the tradeoff of various data exchange mechanisms. We present a hybrid in-memory data loader with multiple communication backends for distributed graph neural network training. We introduce a model-driven performance estimator to switch between communication mechanisms automatically at runtime. The performance estimator uses Tree of Parzen Estimators (TPE), a Bayesian Optimization method, to optimize model parameters and dynamically select the most efficient communication method for data loading. We present our evaluation on two US DOE supercomputers, NERSC Perlmutter and OLCF Summit, on a wide set of runtime configurations. Our optimized implementation outperforms a baseline using single-backend loaders by up to 2.83x and can accurately predict the suitable communication method with an average success rate of 96.3% (Perlmutter) and 94.3% (Summit).
Transformer-based large language models have remarkable potential to accelerate design optimization for applications such as drug development and material discovery. Self-supervised pretraining of transformer models requires large-scale data sets, which are often sparsely populated in topical areas such as polymer science. State-of-the-art approaches for polymers conduct data augmentation to generate additional samples but unavoidably incur extra computational costs. In contrast, large-scale open-source data sets are available for small molecules and provide a potential solution to data scarcity through transfer learning. In this presentation, we discuss using transformers pretrained on small molecules and fine-tuned on polymer properties. We find that this approach achieves comparable accuracy to those trained on augmented polymer data sets for a series of benchmark prediction tasks.