A Synchronous Algorithm for Checkpointing in Distributed Systems

D. Goswami and S. Sahu (India)

Keywords

Checkpoint, synchronous algorithm, global snapshot, con sistent state, rollback recovery,

Abstract

Nodes in a Distributed System are susceptible to failures for many different reasons. In case of such failures the distributed system as a whole needs to be restored to an error free state, existing prior to failure. This restoration is done by rolling back the computation at the nodes to an error free state. To minimize the amount of computation which needs to be rolled back checkpoints or snapshots of a globally consistent state are taken from time to time. We present a synchronous checkpointing algorithm which forces a minimum number of nodes to take a checkpoint. Underlying computation need not be blocked completely during the progress of the algorithm. No additional effort needs to be expended to circumvent the problem of con current initiations of the algorithm, since the initiator node assumes the responsibility of completing one instance be fore another one can be initiated. Since the consistency of the snapshots is ensured at the time the global snapshot is taken, no time needs to be spent during recovery.

Important Links:



Go Back