A Proactive Fault Tolerance Framework for High-Performance Computing

A. Litvinova; C. Engelmann; S.L. Scott

doi:10.2316/P.2010.676-024

A Proactive Fault Tolerance Framework for High-Performance Computing

A. Litvinova (UK), C. Engelmann, and S.L. Scott (USA)

Keywords

high-performance computing, fault tolerance, system monitoring, high availability, reliability

Abstract

As high-performance computing (HPC) systems continue to increase in scale, their mean-time to interrupt decreases respectively. The current state of practice for fault tolerance (FT) is checkpoint/restart. However, with increasing error rates, increasing aggregate memory and not proportionally increasing I/O capabilities, it is becoming less efﬁcient. Proactive FT avoids experiencing failures through preventative measures, such as by migrating application parts away from nodes that are “about to fail”. This paper presents a proactive FT framework that performs environmental monitoring, event logging, parallel job monitoring and resource monitoring to analyze HPC system reliability and to perform FT through such preventative actions.

Important Links:

DOI: 10.2316/P.2010.676-024
From Proceeding (676) Parallel and Distributed Computing and Networks - 2010

Go Back