2022 article

P-ckpt: Coordinated Prioritized Checkpointing

2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2022), pp. 436–446.

author keywords: Fault Tolerance; High-Performance Computing; Failure Prediction; I/O subsystem; Checkpoint/Restart; Live Migration; Burst Buffers
TL;DR: A novel checkpointing technique that aims to maintain the performance efficiency of failure-aware C/R models even when failures are predicted with a small lead time, and creates the hybrid p-ckpt model by integrating Live Migration because of its cost-effectiveness and to reduce checkpoint frequency. (via Semantic Scholar)
Source: Web Of Science
Added: September 29, 2022

Good prediction accuracy and adequate lead time to failure are key to the success of failure-aware Check-point/Restart (C/R) models on current and future large-scale High-Performance Computing (HPC) systems. This paper develops a novel checkpointing technique, called p-ckpt, that aims to maintain the performance efficiency of failure-aware C/R models even when failures are predicted with a small lead time. The p-ckpt technique is developed for HPC systems with multi-level memory systems to prioritize checkpoints from vulnerable nodes (nodes with predicted failure) in the event of failure prediction. It applies coordination among the nodes within an application so that vulnerable nodes' checkpoint data is stored to the Parallel File System (PFS) first by assigning priorities based on the lead time to failure. Vulnerable nodes thus have low-latency access on the critical path to the PFS before any failure happens. Further, we create the hybrid p-ckpt model by integrating Live Migration (LM) because of its cost-effectiveness and to reduce checkpoint frequency. Our hybrid p-ckpt C/R model considers prediction lead time and checkpoint latency to the PFS to decide on a feasible proactive action such as p-ckpt and LM. Simulations of six real-world applications for the Summit supercomputer show a ≈53-65% reduction in overhead due to the hybrid p-ckpt model compared to a ≈31-61% reduction in a state-of-the-art solution. We assess our C/R models against multiple failure distributions and consider lead time variability and failure prediction accuracy. Based on this evaluation and assessment, we discuss the trade-offs of using these models and their impact on application overhead.