Minisymposium

MS6C - European Perspective on Converged HPC and Cloud Hardware & Software Architectures

Fully booked

Wednesday, June 5, 2024

11:30

13:30

CEST

HG E 1.1

Replay

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Session Chair

Tiziano

Müller

HPE

Description

Today, cloud computing technologies have gained prevalence for their benefits in resource dynamism, automation, reproducibility, and resilience. HPC offers access to advanced computing techniques and massive processing capability for grand challenges and scientific discovery. Meanwhile, the computing landscape is changing rapidly towards complex workloads and workflows that combine simulations with data analytics and machine learning. These workloads aim to apply large-scale and distributed computing to domains with high societal impact, such as autonomous vehicles and smart cities. Under the Horizon Europe call 2022 Open source for cloud-based services several projects have gathered to tackle different aspects of this challenge to integrated both Cloud and HPC. In this Minisymposium we are discussing first results as well as the possible impact of some of those projects and their proposed architectures on the HPC landscape.

Presentations

11:30

12:00

CEST

Managing Converged HPC and Cloud Architectures with CSM/OCHAMI in OpenCUBE

The convergence of cloud and HPC technologies has become a major theme in recent years. Virtualization and orchestration are increasingly used to offer an integrated workflow experience across heterogeneous hardware, be it a supercomputer or web service. Within the OpenCUBE project, we aim to develop an innovative full-stack solution for a European cloud computing blueprint that bridges this continuum while incorporating European Processor Initiative hardware.

The Cray System Management (CSM) is a cloud-based solution that delivers a system management platform that merges microservices and cloud technologies with HPC software to enable the management of large-scale supercomputers. OCHAMI, launched as an open community effort, consisting of LANL, LBNL, NERSC, CSCS, HPE, and Bristol University further extends on the CSM implementation to offer additional, tailored solutions.

In this talk, we explore the differences and commonalities between the cloud and HPC approaches to computing. We present the new OpenCUBE software and hardware stack centered around the CSM and OCHAMI in combination with a high-performance interconnect, and discuss how it is going to solve the cloud/HPC integration issues on the architecture and cluster management level. Finally, we give an outlook to current and future developments in the field.

Nina Mujkanovic and Tiziano Müller (HPE)

With Thorsten Kurth (NVIDIA Inc.)

12:00

12:30

CEST

Challenges and Opportunities in Running Kubernetes Workloads on HPC

Cloud and HPC increasingly converge in hardware platform capabilities and specifications, nevertheless still largely differ in the software stack and how it manages available resources.The HPC world typically favors Slurm for job scheduling, whereas Cloud deployments rely on Kubernetes to orchestrate container instances across nodes. Running hybrid workloads is possible by using bridging mechanisms that submit jobs from one environment to the other.However, such solutions require costly data movements, while operating within the constraints set by each setup's network and access policies. In this presentation, we introduce an container-based approach design that enables running unmodified Kubernetes workloads directly on HPC systems, by having users deploy their own private Kubernetes mini Cloud, which internally converts container lifecycle management commands to use the HPC system-level Slurm infrastructure for scheduling and Singularity/Apptainer as the container runtime. We consider this approach to be practical for deployment in HPC centers, as it requires minimal pre-configuration and retains existing resource management and accounting policies.

Antony Chazapis (FORTH); Fotis Nikolaidis (SuperDuperDB); Manolis Marazakis (FORTH); and Angelos Bilas (FORTH, University of Crete)

With Thorsten Kurth (NVIDIA Inc.)

12:30

13:00

CEST

Emerging Paradigms in the Convergence of High-Performance Computing and Cloud

We will present and discuss the design considerations for meeting HPC applications’ requirements on performance and adaptivity in cloud-native containerized environments. As an example of emerging workflows for cloud computing, a case study of transforming a moleculedocking workflow used in drug discovery into cloud-native Apache Airflow on a Kubernetes clusterwill be presented. To accommodate the increasing dynamic natures in HPC applications, we will also present work atop K8s’s autoscaler to enable elastic executions of HPC applications and specifically tackle the challenge of adapting tightly coupled MPI applications into cloud environment. Finally, an outlook on the need for reactive system services for interference mitigation will be presented.

Ivy Peng and Daniel Medeiros (KTH Royal Institute of Technology)

With Thorsten Kurth (NVIDIA Inc.)

13:00

13:30

CEST

The FDB: Developments Supporting a Semantic Approach to Scientific Data Managament

Data management plays a vital role in complex scientific workflows. As systems and workflows expand, become more heterogenous, and integrate components across the HPC-cloud ecosystem, the data management challenges become larger.

We introduce the FDB, a specialised object store for meteorological data developed in-house at ECMWF, along with its metadata-driven API and access semantics optimised for time-critical forecasting workflows. The FDB was initially developed to absorb Numerical Weather Prediction model output, managing access to the global parallel filesystem in an HPC environment, but it has grown to provide a larger, more-general, multi-system data ecosystem.

New developments in this ecosystem include a remote protocol, to provide access between distinct HPC and cloud systems, as well as GRIBJump, a library integrated with the FDB to enable users to directly and efficiently extract sub-features from large data objects (from a single data point, to large multi-dimensional subdomains).

We are developing further backends for the FDB, using Fabric Attached Memory (FAM) to facilitate direct in-memory transfer of data between HPC and cloud partitions within the OpenCUBE system. We are also exploring the role of additional flexibility in the metadata language. We discuss how these developments support scientific workflows in HPC and cloud domains.

Emanuele Danovaro, Christopher Bradley, Nicolau Manubens, Metin Cakircali, Simon Smart, and Tiago Quintino (ECMWF)

With Thorsten Kurth (NVIDIA Inc.)

Bookmark
this session

Unbookmark
this session

Saving...