Back

Minisymposium

MS4E - In Situ Coupling of Simulations and AI/ML for HPC: Software, Methodologies, and Applications - Part II

Fully booked
Tuesday, June 4, 2024
16:00
-
18:00
CEST
HG E 3

Replay

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Session Chair

Description

Motivated by the remarkable success of artificial intelligence (AI) and machine learning (ML) in the fields of computer vision and natural language processing, over the last decade there has been a host of successful applications of AI/ML to a variety of scientific domains. In most cases, the models are trained using the traditional offline (or post hoc) approach, wherein the training data is produced, assembled, and curated separately before training is deployed. While more straightforward, the offline training workflow can impose some important restrictions to the adoption of ML models for scientific applications. To solve these limitations, in situ (or online) ML approaches, wherein ML tasks are performed concurrently to the ongoing simulation, have recently emerged as an attractive new paradigm. In this minisymposium, we explore novel approaches to enable the coupling of state-of-the-art simulation codes with different AI/ML techniques. We discuss the open-source software libraries that are being developed to solve the software engineering challenges of in situ ML workflows, as well as the methodologies adopted to scale on modern HPC systems and their applications to solve complex problems in different computational science domains.

Presentations

16:00
-
16:30
CEST
SimAI-Bench: A Performance Benchmarking Tool for Coupled Simulation and AI Workflows

In situ AI/ML workflows, in which ML tasks are coupled to an ongoing simulation, are an attractive new paradigm for developing robust and predictive surrogate models for accelerating time to science by steering simulation ensembles and replacing expensive computations. In the world of high performance computing (HPC), these workflows require scalable and efficient solutions to integrate the rapidly evolving ecosystem of ML frameworks with traditional simulation codes by transferring large volumes of data between the various components. To address these issues, several libraries have recently emerged from groups in industry, academia, and national labs. In this talk, we introduce SimAI-Bench – a new tool for benchmarking and comparing the performance of different coupled simulation and AI/ML workflows on current and future HPC systems. In particular, the talk will focus on workflows for in situ training of graph neural network (GNN) surrogate models from ongoing computational fluid dynamic (CFD) simulations, requiring the transfer of training data between the two components. We will discuss how different open-source libraries enable such workflows and compare their data transfer performance and scaling efficiency on the Aurora supercomputer at the Argonne Leadership Computing Facility.

Riccardo Balin, Shivam Barwey, Ramesh Balakrishnan, Bethany Lusch, Saumil Patel, Tom Uram, and Venkatram Vishwanath (Argonne National Laboratory)
With Thorsten Kurth (NVIDIA Inc.)
16:30
-
17:00
CEST
Scalable and Consistent Mesh-Based Modeling of Fluid Flows with Distributed Graph Neural Networks

Graph neural networks (GNNs) have shown considerable promise in accelerated mesh-based modeling for applications like fluid dynamics, where models must be compatible with unstructured grids for practical simulation capability in complex geometries. To realize the vision of robust mesh-based modeling, however, the question of scalability to large graph sizes (O(10M) nodes and beyond) must be addressed, particularly when interfacing with unstructured data produced by high-fidelity computational fluid dynamics (CFD) codes. As such, we focus on the development of a distributed GNN that relies on novel alterations to the baseline message passing layer to facilitate scalable operations with consistency. Here, consistency refers to the fact that a GNN trained and evaluated on one rank is arithmetically equivalent to evaluations on multiple ranks. Demonstrations are performed in the context of in-situ coupling of GNNs with NekRS, an exascale CFD code, using the Polaris supercomputer at the Argonne Leadership Computing Facility. The crux of the NekRS-GNN approach is to show how the same CFD domain-decomposition strategy can be linked to the distributed GNN training and inference routines. Emphasis is placed on two modeling applications: (1) developing surrogates for unsteady fluid dynamics forecasts, and (2) mesh-based super-resolution of turbulent flows.

Shivam Barwey, Riccardo Balin, Bethany Lusch, Saumil Patel, Ramesh Balakrishnan, and Pinaki Pal (Argonne National Laboratory); Romit Maulik (University of Pennsylvania, Argonne National Laboratory); and Venkatram Vishwanath (Argonne National Laboratory)
With Thorsten Kurth (NVIDIA Inc.)
17:00
-
17:30
CEST
TorchFort: A Library for Online Deep Learning in Fortran HPC Programs

Deep learning has shown promise in reducing computational cost or as an alternative method for modeling physical phenomena for a broad range of scientific applications. In these domains, the data sources are numerical simulation programs typically implemented in C, C++, or still often, Fortran. This is in contrast to popular deep learning frameworks that users interact with using Python. A source of friction that often arises is how to efficiently couple the simulation program with the DL framework for training or inference.

In this talk, we discuss TorchFort, a library for online DL training and inference implemented with LibTorch, the C++ backend used by PyTorch. This library can be invoked directly from Fortran/C/C++, enabling transparent sharing of data arrays from the simulation program to the DL framework, all contained within the simulation process. We will talk about the library design and some implementation examples to present opportunities this tight coupling presents for DL applications.

Thorsten Kurth, Josh Romero, and Massimiliano Fatica (NVIDIA Inc.)
With Thorsten Kurth (NVIDIA Inc.)
17:30
-
18:00
CEST
Scaling Coupled Simulation and AI Workflows on Aurora with Dragon

The advent of exascale computing has enabled computational workflows coupling simulation and AI of unprecedented scale and complexity. However, the scale of these workflows presents challenges for the efficient distribution of data between the various compute tasks spread across large node counts. In this talk, we present the Dragon open-source library as a tool for designing and executing data-intensive scientific workflows on modern HPC systems. In particular, Dragon’s sharded memory model allows compute tasks to access data stored in memory regardless of node locality by means of automated RDMA transfers, which are made available to the user through high-level data transfer APIs written in C, C++, and Python. This enables the transfer of interdependent data across different components of the workflow, avoiding costly I/O to the filesystem or deploying a database. We demonstrate the use of Dragon and its performance on the Aurora supercomputer at the Argonne Leadership Computing Facility with a workflow designed to identify new candidates for cancer drugs by combining simulation with ML training and inference to accelerate high-throughput screening of 22 billion molecular compounds.

Christine Simpson, Riccardo Balin, Archit Vasan, and Sam Foreman (Argonne National Laboratory); Peter Mendygral, Colin Wahl, and Nick Hill (HPE); and Venkatram Vishwanath (Argonne National Laboratory)
With Thorsten Kurth (NVIDIA Inc.)