Back

Paper

AP1D - ACM Papers Session 1D

Fully booked
Monday, June 3, 2024
17:00
-
18:00
CEST
HG E 1.2

Replay

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Session Chair

Description

Presentations

17:00
-
17:30
CEST
Enabling Performance Portability for Shallow Water Equations on CPUs, GPUs, and FPGAs with SYCL

In order to make the best use of the diverse hardware architectures in present and future high-performance computers, developers and maintainers of scientific simulation codes strive for performance portability. The goal is to reach a good fraction of the hardware-specific practically achievable performance while maintaining a largely unified codebase. In benchmarks and first production codes, SYCL has been demonstrated to be a promising programming model for this purpose when targeting different CPU and GPUs. In this work, we utilize SYCL to develop a performance portable implementation of the 2D shallow water equations, discretized on unstructured triangular meshes using the discontinuous Galerkin method with polynomial orders zero, one, and two. In addition to GPUs from three and CPUs from two vendors, we also broaden the scope of target architectures by including Intel Stratix FPGAs with a fundamentally different execution model. We show that with a few targeted and encapsulated specializations, it is possible to adapt the execution flow to the respective targets. The performance analysis shows how FPGAs complement the other two architectures with particularly good performance for small problem sizes.

Markus Büttner (University of Bayreuth); Christoph Alt (Paderborn University, Friedrich-Alexander-Universität Erlangen-Nürnberg); Tobias Kenter (Paderborn University); Harald Köstler (Friedrich-Alexander-Universität Erlangen-Nürnberg); Christian Plessl (Paderborn University); and Vadym Aizinger (University of Bayreuth)
With Thorsten Kurth (NVIDIA Inc.)
17:30
-
18:00
CEST
Lockstep-Parallel Dualization of Surface Triangulations

We present a massively parallel lockstep algorithm for dualizing large numbers of surface triangulation graphs, and an effective implementation for CPU, GPU and multi-GPU. The algorithm is fully combinatorial, i.e., it does not require or use a planar or spatial embedding, only the graph.

This work is motivated by a wish to perform computational chemistry experiments on entire isomerspaces of polyhedral molecules, comprising billions of distinct molecules, each represented by a cubic graph. However, the algorithm applies not only to triangulations of the sphere, but to any triangulations of oriented surfaces of any genus, for example toroidal topologies.

Our multi-vendor implementation in SYCL outperforms the previous sequential state-of-the-art by 4 orders of magnitude on our consumer NVIDIA RTX3080 Graphics Processing Unit (GPU), with average throughput 37ps(+/- 0.1ps) per vertex (varying from 50ps to 31ps for C72-C200). Thus, dualizing e.g. all 214,127,742 C200 fullerene molecules adds a mere 1.49s(+/- 0.01s) to the total processing time, negligible compared to the two hours required to generate the graphs. We subsequently perform extreme multi-node-multi-GPU scaling experiments on the LUMI-G supercomputer, achieving near-perfect scaling up to 1024 MI250x Graphics Compute Dies (GCD), in total 14.5 million cores. Calculations show that dualization has moved from a bottle-neck to being ready to contribute to our planned large-scale chemical experiments for all 2.7 x 10^12 fullerene molecules from C20 through C400.

Jonas Dornonville de la Cour (Aarhus University); Carl-Johannes Johnsen (University of Copenhagen); and James Emil Avery (Aarhus University, University of Copenhagen)
With Thorsten Kurth (NVIDIA Inc.)