Using Checkpointing for Fault Tolerance and Parallel Program Debugging

N. Thoai (Vietnam), D. Kranzlm├╝ller, and J. Volkert (Austria)


Fault tolerance, Debugging, Checkpointing


Checkpointing and rollback recovery are widely used in fault-tolerant computing. These techniques allow a running program to be restarted from an earlier state of its execu tion, when a failure suddenly happens. The idea is to re duce the amount of lost work. Besides fault tolerance, such techniques are also used for cyclic debugging, where they intend to reduce the waiting time in repeated debugging cycles. However, compared to fault tolerance checkpoint ing, only few methods are available for debugging. At the same time, some strict requirements of debugging prohibit most methods used in fault tolerance. Therefore, a com parison of requirements when using checkpointing in both areas is important and useful to develop applicable methods for parallel program debugging. The paper will discuss this problem and show suitable methods for debugging, which enable cyclic debugging to be used for long-running pro grams while preserving a small waiting time.

Important Links:

Go Back