2021 article

Systemic Assessment of Node Failures in HPC Production Platforms

2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), pp. 267–276.

author keywords: Root Cause; Node Failures; Holistic Analysis
TL;DR: It is shown that external environmental influence is not strongly correlated with node failures in terms of the root cause, and lead time enhancements are feasible for nodes showing fail slow characteristics. (via Semantic Scholar)
UN Sustainable Development Goal Categories
Source: Web Of Science
Added: October 4, 2021

Production HPC clusters endure failures reducing computational capability and resource availability. Despite the presence of various failure prediction schemes for large-scale computing systems, a comprehensive understanding of how nodes fail considering various components and layers of the system is required for sustained resilience. This work performs a holistic diagnosis of node failures using a measurement-driven approach on contemporary system logs that can help vendors and system administrators support exascale resilience.Our work shows that external environmental influence is not strongly correlated with node failures in terms of the root cause. Though hardware and software faults trigger failures, the underlying root cause often lies in the application malfunctioning causing the system to fail. Furthermore, lead time enhancements are feasible for nodes showing fail slow characteristics. This study excavates such helpful empirical observations, which could facilitate better failure handling in production systems.