Minisymposium
MS4E - In Situ Coupling of Simulations and AI/ML for HPC: Software, Methodologies, and Applications - Part II
Replay
Session Chair
Description
Motivated by the remarkable success of artificial intelligence (AI) and machine learning (ML) in the fields of computer vision and natural language processing, over the last decade there has been a host of successful applications of AI/ML to a variety of scientific domains. In most cases, the models are trained using the traditional offline (or post hoc) approach, wherein the training data is produced, assembled, and curated separately before training is deployed. While more straightforward, the offline training workflow can impose some important restrictions to the adoption of ML models for scientific applications. To solve these limitations, in situ (or online) ML approaches, wherein ML tasks are performed concurrently to the ongoing simulation, have recently emerged as an attractive new paradigm. In this minisymposium, we explore novel approaches to enable the coupling of state-of-the-art simulation codes with different AI/ML techniques. We discuss the open-source software libraries that are being developed to solve the software engineering challenges of in situ ML workflows, as well as the methodologies adopted to scale on modern HPC systems and their applications to solve complex problems in different computational science domains.
Presentations
In situ AI/ML workflows, in which ML tasks are coupled to an ongoing simulation, are an attractive new paradigm for developing robust and predictive surrogate models for accelerating time to science by steering simulation ensembles and replacing expensive computations. In the world of high performance computing (HPC), these workflows require scalable and efficient solutions to integrate the rapidly evolving ecosystem of ML frameworks with traditional simulation codes by transferring large volumes of data between the various components. To address these issues, several libraries have recently emerged from groups in industry, academia, and national labs. In this talk, we introduce SimAI-Bench – a new tool for benchmarking and comparing the performance of different coupled simulation and AI/ML workflows on current and future HPC systems. In particular, the talk will focus on workflows for in situ training of graph neural network (GNN) surrogate models from ongoing computational fluid dynamic (CFD) simulations, requiring the transfer of training data between the two components. We will discuss how different open-source libraries enable such workflows and compare their data transfer performance and scaling efficiency on the Aurora supercomputer at the Argonne Leadership Computing Facility.
Graph neural networks (GNNs) have shown considerable promise in accelerated mesh-based modeling for applications like fluid dynamics, where models must be compatible with unstructured grids for practical simulation capability in complex geometries. To realize the vision of robust mesh-based modeling, however, the question of scalability to large graph sizes (O(10M) nodes and beyond) must be addressed, particularly when interfacing with unstructured data produced by high-fidelity computational fluid dynamics (CFD) codes. As such, we focus on the development of a distributed GNN that relies on novel alterations to the baseline message passing layer to facilitate scalable operations with consistency. Here, consistency refers to the fact that a GNN trained and evaluated on one rank is arithmetically equivalent to evaluations on multiple ranks. Demonstrations are performed in the context of in-situ coupling of GNNs with NekRS, an exascale CFD code, using the Polaris supercomputer at the Argonne Leadership Computing Facility. The crux of the NekRS-GNN approach is to show how the same CFD domain-decomposition strategy can be linked to the distributed GNN training and inference routines. Emphasis is placed on two modeling applications: (1) developing surrogates for unsteady fluid dynamics forecasts, and (2) mesh-based super-resolution of turbulent flows.
Deep learning has shown promise in reducing computational cost or as an alternative method for modeling physical phenomena for a broad range of scientific applications. In these domains, the data sources are numerical simulation programs typically implemented in C, C++, or still often, Fortran. This is in contrast to popular deep learning frameworks that users interact with using Python. A source of friction that often arises is how to efficiently couple the simulation program with the DL framework for training or inference.
In this talk, we discuss TorchFort, a library for online DL training and inference implemented with LibTorch, the C++ backend used by PyTorch. This library can be invoked directly from Fortran/C/C++, enabling transparent sharing of data arrays from the simulation program to the DL framework, all contained within the simulation process. We will talk about the library design and some implementation examples to present opportunities this tight coupling presents for DL applications.
The advent of exascale computing has enabled computational workflows coupling simulation and AI of unprecedented scale and complexity. However, the scale of these workflows presents challenges for the efficient distribution of data between the various compute tasks spread across large node counts. In this talk, we present the Dragon open-source library as a tool for designing and executing data-intensive scientific workflows on modern HPC systems. In particular, Dragon’s sharded memory model allows compute tasks to access data stored in memory regardless of node locality by means of automated RDMA transfers, which are made available to the user through high-level data transfer APIs written in C, C++, and Python. This enables the transfer of interdependent data across different components of the workflow, avoiding costly I/O to the filesystem or deploying a database. We demonstrate the use of Dragon and its performance on the Aurora supercomputer at the Argonne Leadership Computing Facility with a workflow designed to identify new candidates for cancer drugs by combining simulation with ML training and inference to accelerate high-throughput screening of 22 billion molecular compounds.