Monitoring in BigHPC: Lessons Learned
Date: November 17, 2022 | 3.00 p.m. (GMT)
Speaker: Júlio Costa, Data Analyst, Wavecom
Moderator: Rohan Kadekodi, UT Austin
Monitoring consists of using ETL processes which stand for Extract, Transform, and Load data into a usable resource which can be visualized with ease in a graphical interface. For the BigHPC project the main mission of the monitoring component is to empower users with a better understanding of their jobs workload and to help system admins to predict possible malfunctions or misbehaved applications.
Big Data applications in HPC’s require special care since their behavior is different from typical HPC workloads, henceforth new challenges arise. In addition, the permissions granted by the scheduler are limited to the workload user. All this combined, has led to some trials and errors during the development of the monitoring system. In this webinar we pretend to give a general overview of the lessons learned, the concepts and solutions implemented and provide notions on how to create meaningful visualizations for HPC.