Paper
AP1D - ACM Papers Session 1D
Replay
In order to make the best use of the diverse hardware architectures in present and future high-performance computers, developers and maintainers of scientific simulation codes strive for performance portability. The goal is to reach a good fraction of the hardware-specific practically achievable performance while maintaining a largely unified codebase. In benchmarks and first production codes, SYCL has been demonstrated to be a promising programming model for this purpose when targeting different CPU and GPUs. In this work, we utilize SYCL to develop a performance portable implementation of the 2D shallow water equations, discretized on unstructured triangular meshes using the discontinuous Galerkin method with polynomial orders zero, one, and two. In addition to GPUs from three and CPUs from two vendors, we also broaden the scope of target architectures by including Intel Stratix FPGAs with a fundamentally different execution model. We show that with a few targeted and encapsulated specializations, it is possible to adapt the execution flow to the respective targets. The performance analysis shows how FPGAs complement the other two architectures with particularly good performance for small problem sizes.
We present a massively parallel lockstep algorithm for dualizing large numbers of surface triangulation graphs, and an effective implementation for CPU, GPU and multi-GPU. The algorithm is fully combinatorial, i.e., it does not require or use a planar or spatial embedding, only the graph.
This work is motivated by a wish to perform computational chemistry experiments on entire isomerspaces of polyhedral molecules, comprising billions of distinct molecules, each represented by a cubic graph. However, the algorithm applies not only to triangulations of the sphere, but to any triangulations of oriented surfaces of any genus, for example toroidal topologies.
Our multi-vendor implementation in SYCL outperforms the previous sequential state-of-the-art by 4 orders of magnitude on our consumer NVIDIA RTX3080 Graphics Processing Unit (GPU), with average throughput 37ps(+/- 0.1ps) per vertex (varying from 50ps to 31ps for C72-C200). Thus, dualizing e.g. all 214,127,742 C200 fullerene molecules adds a mere 1.49s(+/- 0.01s) to the total processing time, negligible compared to the two hours required to generate the graphs. We subsequently perform extreme multi-node-multi-GPU scaling experiments on the LUMI-G supercomputer, achieving near-perfect scaling up to 1024 MI250x Graphics Compute Dies (GCD), in total 14.5 million cores. Calculations show that dualization has moved from a bottle-neck to being ready to contribute to our planned large-scale chemical experiments for all 2.7 x 10^12 fullerene molecules from C20 through C400.