Minisymposium

MS3C - Scalable Machine Learning and Generative AI for Materials Design

Fully booked

Tuesday, June 4, 2024

11:00

13:00

CEST

HG E 1.1

Replay

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Session Chair

John

Gounley

Oak Ridge National Laboratory

Description

The design and discovery of materials with desired functional properties is challenging due to labor-intensive experimental measurements and computationally expensive physics-based models, which preclude a thorough exploration of large chemical spaces characterized by several chemical compositions and atomic configurations per composition. This disconnect has motivated the development of data-driven surrogate models that can overcome experimental and computational bottlenecks to enable an effective exploration of such vast chemical spaces. In this minisymposium, we discuss new generative artificial intelligence (AI) methods to perform materials design. A particular advantage of generative AI approaches is their ability to learn the context and syntax of molecular data described by fundamental principles of physics and chemistry, providing a critical basis for informing the generative design of molecules. In order to ensure generalizability and robustness of the generative model, the generative AI model needs to be trained on a large volume of data that thoroughly samples diverse chemical regions. Due to the large volumes of data that must be processed, efficiently training these models requires leveraging a massive amount of high performance computing (HPC) resources for scalable training. This minisymposium aims to broadly cover HPC aspects for scalable generative AI models across several heterogeneous distributed computational environments.

Presentations

11:00

11:30

CEST

HydraGNN: Scalable Machine Learning and Generative AI for Accelerating Materials Design

We discuss the challenges involved in developing large-scale training for generative AI models aimed at material design. We employ HydraGNN, a scalable graph neural network (GNN) framework, alongside DDStore, a distributed in-memory data store, to facilitate large-scale data distribution across the supercomputing resources provided by the US Department of Energy (DOE). Our discussion includes insights into our implementation and the notable decrease in I/O overhead within HPC environments. The effectiveness of HydraGNN and DDStore is showcased through its application for molecular design, where a GNN model learns to predict the ultraviolet-visible spectrum based on a dataset comprising over 10 million molecules. By enabling efficient training scale-up to thousands of GPUs on the Summit and Perlmutter supercomputers, DDStore has achieved a significant boost in DL training speed, recording up to a 6.15 times faster performance than our initial methods. We will discuss the performance advancements on the new Frontier supercomputer at the Oak Ridge National Laboratory (ORNL), highlighting the evolving landscape of supercomputing in AI research.

Jong Youl Choi, Massimiliano Lupo Pasini, Pei Zhang, and Kshitij Mehta (Oak Ridge National Laboratory) and Jonghyun Bae and Khaled Ibrahim (Lawrence Berkeley National Laboratory)

With Thorsten Kurth (NVIDIA Inc.)

11:30

12:00

CEST

Large Language Models and Agentic Systems for Bio-Inspired Materials Design

From seashells to mammal hooves to plant stems, biological materials have long captivated materials scientists and mechanical engineers due to their impressive hierarchical structure-property relationships. By understanding biological insights and motifs, the design of bio-inspired materials is empowered and poised to benefit a diverse range of applications, including sustainability. Modern generative AI frameworks, especially large language models (LLMs), show remarkable potential for science-focused applications, excelling notably in the study of biological materials through the utilization of rich legacy literature. We present BioinspiredLLM, an open-source conversational large language model that was finetuned on a corpus of biological materials literature. The model shows strong abilities in knowledge recall, creative hypothesis generation, and seamless integration into multi-agent systems. Multi-agent/agentic systems facilitate the interaction of multiple advanced AI systems, thereby expanding the scope of knowledge, enhancing data retrieval capabilities, and fostering critical thinking. This approach is demonstrated through multiple bio-inspired materials design scenarios.

Rachel Luu and Markus Buehler (Massachusetts Institute of Technology)

With Thorsten Kurth (NVIDIA Inc.)

12:00

12:30

CEST

Efficient Training of GNN-based Material Science Applications at Scale: An Orchestration of Data Movement Approach

Scalable data management techniques are crucial to effectively processing large volumes of scientific data on HPC platforms for distributed deep learning (DL) model training. Because of the need to access data randomly and frequently in stochastic optimizers, in-memory distributed storage that keeps the dataset in the local memory of each computing node is widely adopted over file-based I/O for its rapid speed. In this presentation, we discuss the tradeoff of various data exchange mechanisms. We present a hybrid in-memory data loader with multiple communication backends for distributed graph neural network training. We introduce a model-driven performance estimator to switch between communication mechanisms automatically at runtime. The performance estimator uses Tree of Parzen Estimators (TPE), a Bayesian Optimization method, to optimize model parameters and dynamically select the most efficient communication method for data loading. We present our evaluation on two US DOE supercomputers, NERSC Perlmutter and OLCF Summit, on a wide set of runtime configurations. Our optimized implementation outperforms a baseline using single-backend loaders by up to 2.83x and can accurately predict the suitable communication method with an average success rate of 96.3% (Perlmutter) and 94.3% (Summit).

Jonghyun Bae (Lawrence Berkeley National Laboratory); Jong Youl Choi, Massimiliano Lupo Pasini, and Kshitij Mehta (Oak Ridge National Laboratory); Khaled Ibrahim (Lawrence Berkeley National Laboratory) and Pei Zhang (Oak Ridge National Laboratory)

With Thorsten Kurth (NVIDIA Inc.)

12:30

13:00

CEST

Transferring a Molecular Foundation Model for Polymer Property Predictions

Transformer-based large language models have remarkable potential to accelerate design optimization for applications such as drug development and material discovery. Self-supervised pretraining of transformer models requires large-scale data sets, which are often sparsely populated in topical areas such as polymer science. State-of-the-art approaches for polymers conduct data augmentation to generate additional samples but unavoidably incur extra computational costs. In contrast, large-scale open-source data sets are available for small molecules and provide a potential solution to data scarcity through transfer learning. In this presentation, we discuss using transformers pretrained on small molecules and fine-tuned on polymer properties. We find that this approach achieves comparable accuracy to those trained on augmented polymer data sets for a series of benchmark prediction tasks.

Pei Zhang, Logan Kearney, Debsindhu Bhowmik, Zachary Fox, Amit Naskar, and John Gounley (Oak Ridge National Laboratory)

With Thorsten Kurth (NVIDIA Inc.)

Bookmark
this session

Unbookmark
this session

Saving...