Back

Minisymposium Presentation

EMOI: CSCS Extensible Monitoring and Observability Infrastructure

Monday, June 3, 2024
12:30
-
13:00
CEST
Climate, Weather and Earth Sciences
Climate, Weather and Earth Sciences
Climate, Weather and Earth Sciences
Chemistry and Materials
Chemistry and Materials
Chemistry and Materials
Computer Science and Applied Mathematics
Computer Science and Applied Mathematics
Computer Science and Applied Mathematics
Humanities and Social Sciences
Humanities and Social Sciences
Humanities and Social Sciences
Engineering
Engineering
Engineering
Life Sciences
Life Sciences
Life Sciences
Physics
Physics
Physics

Description

The Swiss National Supercomputing Centre (CSCS) is expanding its computational capabilities with the Alps architecture, a Cray HPE EX system incorporating around 5000 GH200 modules, in addition to the pre-existing nodes. This expansion poses challenges in monitoring due to hardware heterogeneity, including AMD Rome CPUs, Mi250x and Mi300 GPUs, Nvidia A100, and the Arm-based Grace-Hopper GH200. Implementing measures to decrease power usage can help reduce the operational costs and environmental challenges associated with supercomputers. To address these challenges, CSCS has developed an Extensible Monitoring and Observability Infrastructure (EMOI), designed to manage the substantial data influx and provide insightful analysis of the infrastructure's behavior. EMOI integrates with Cray System Management (CSM) and Cray System Monitoring Application (SMA), emphasizing a Kafka-centric approach for enhanced interoperability. We will delve into the structure and quality of collected datasets, focusing on power consumption data. We hope that our experience will be beneficial not only to CSCS but also to other HPE/Cray sites facing similar challenges in supercomputing infrastructure management.

Authors