Paper

AP2D - ACM Papers Session 2D

Fully booked

Tuesday, June 4, 2024

14:00

15:30

CEST

HG E 1.2

Replay

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Session Chair

Fazeleh

Kazemian

Australian National University

Description

Presentations

14:00

14:30

CEST

Hybrid Parallel Tucker Decomposition of Streaming Data

Tensor decompositions have emerged as powerful tools of multivariate data analysis, providing the foundation of numerous analysis methods. The Tucker decomposition in particular has been shown to be quite effective at compressing high-dimensional scientific data sets. However, applying these techniques to modern scientific simulation data is challenged by the massive data volumes these codes can produce, requiring scalable tensor decomposition methods that can exploit the hybrid parallelism available on modern computing architectures, as well as support in situ processing to compute decompositions as these simulations generate data. In this work, we overcome these challenges by presenting a first-ever hybrid parallel and performance-portable approach for Tucker decomposition of both batch and streaming data. Our work is based on the TuckerMPI package, which provides scalable, distributed memory Tucker decomposition techniques, as well as prior work on a sequential streaming Tucker decomposition algorithm. We extend TuckerMPI to hybrid parallelism through the use of the Kokkos/Kokkos-Kernels performance portability packages, develop a hybrid parallel streaming Tucker decomposition algorithm, and demonstrate performance and portability of these approaches on a variety of large-scale scientific data sets on both CPU and GPU architectures.

Saibal De and Hemanth Kolla (Sandia National Laboratories), Antoine Meyer (NexGen Analytics), Eric T. Phipps (Sandia National Laboratories), and Francesco Rizzi (NexGen Analytics)

With Thorsten Kurth (NVIDIA Inc.)

14:30

15:00

CEST

Hybrid Multi-GPU Distributed Octrees Construction for Massively Parallel Code Coupling Applications

This paper presents two new hybrid MPI-GPU algorithms for building distributed octrees. The first algorithm redistributes data between processes and is used to globally sort the points on which the octree is generated, according to their SFC codes. The second algorithm proposes a bottom-up approach to merge leaves from the maximum depth to their final level, ensuring that each leaf contains no more than Nmax points. This method is better suited for GPU implementation because it maximises parallelism from the beginning of the algorithm. The methods have been implemented in the CWIPI library to reduce the execution time of the point-in-mesh location algorithm, which is performed several times when moving non-coincident meshes are used. Tests on large cases have shown speedups of up to x120 compared to a conventional CPU version, with scaling as good as the full CPU version.

Robin Cazalbou (ONERA); Florent Duchaine (CERFACS); and Eric Quémerais, Bastien Andrieu, Gabriel Staffelbach, and Bruno Maugars (ONERA)

With Thorsten Kurth (NVIDIA Inc.)

15:00

15:30

CEST

Using Read-After-Read Dependencies to Control Task-Granularity

In compiler theory, data analysis is used to exploit Instruction Level Parallelism (ILP). Three dependencies are used in modern compilers and hardware schemes efficiently and are fundamental to any code compilation. Read-after-read (RAR) has been left out, as it cannot cause a data hazard. This article introduces a novel method to use the additional dependence information contained in any code to enhance automatic parallelization. The method builds groups of arbitrary sequential instruction chains during static code analysis and introduces potential transfers between these groups. This gives new opportunities when optimizing code to a parallel processing hardware. The segmentation enables more information concerning the potential parallelization of the code and enhance optimization opportunities to be gained during static code analysis. The novel principle is introduced using a very simple example and then the segmentation is applied in task- and data-parallelism examples. The automatic parallelization to a multicore-platform is demonstrated based on the new segmentation method. The ability to forecast the optimal distribution of the segments for a platform with two key parameters and resulting codes are compared to measured speedups.

Andres Gartmann (mynatix ag) and Mathias Müller (meteoblue ag)

With Thorsten Kurth (NVIDIA Inc.)

Bookmark
this session

Unbookmark
this session

Saving...