Paper
AP2D - ACM Papers Session 2D
Replay
Tensor decompositions have emerged as powerful tools of multivariate data analysis, providing the foundation of numerous analysis methods. The Tucker decomposition in particular has been shown to be quite effective at compressing high-dimensional scientific data sets. However, applying these techniques to modern scientific simulation data is challenged by the massive data volumes these codes can produce, requiring scalable tensor decomposition methods that can exploit the hybrid parallelism available on modern computing architectures, as well as support in situ processing to compute decompositions as these simulations generate data. In this work, we overcome these challenges by presenting a first-ever hybrid parallel and performance-portable approach for Tucker decomposition of both batch and streaming data. Our work is based on the TuckerMPI package, which provides scalable, distributed memory Tucker decomposition techniques, as well as prior work on a sequential streaming Tucker decomposition algorithm. We extend TuckerMPI to hybrid parallelism through the use of the Kokkos/Kokkos-Kernels performance portability packages, develop a hybrid parallel streaming Tucker decomposition algorithm, and demonstrate performance and portability of these approaches on a variety of large-scale scientific data sets on both CPU and GPU architectures.
This paper presents two new hybrid MPI-GPU algorithms for building distributed octrees. The first algorithm redistributes data between processes and is used to globally sort the points on which the octree is generated, according to their SFC codes. The second algorithm proposes a bottom-up approach to merge leaves from the maximum depth to their final level, ensuring that each leaf contains no more than Nmax points. This method is better suited for GPU implementation because it maximises parallelism from the beginning of the algorithm. The methods have been implemented in the CWIPI library to reduce the execution time of the point-in-mesh location algorithm, which is performed several times when moving non-coincident meshes are used. Tests on large cases have shown speedups of up to x120 compared to a conventional CPU version, with scaling as good as the full CPU version.
In compiler theory, data analysis is used to exploit Instruction Level Parallelism (ILP). Three dependencies are used in modern compilers and hardware schemes efficiently and are fundamental to any code compilation. Read-after-read (RAR) has been left out, as it cannot cause a data hazard. This article introduces a novel method to use the additional dependence information contained in any code to enhance automatic parallelization. The method builds groups of arbitrary sequential instruction chains during static code analysis and introduces potential transfers between these groups. This gives new opportunities when optimizing code to a parallel processing hardware. The segmentation enables more information concerning the potential parallelization of the code and enhance optimization opportunities to be gained during static code analysis. The novel principle is introduced using a very simple example and then the segmentation is applied in task- and data-parallelism examples. The automatic parallelization to a multicore-platform is demonstrated based on the new segmentation method. The ability to forecast the optimal distribution of the segments for a platform with two key parameters and resulting codes are compared to measured speedups.