Checkpointing and Recovery in a Transaction-based DSM Operating System

M. Schoettner, S. Frenz, R. Goeckelmann, and P. Schulthess (Germany)


Reliability, Operating Systems, Distributed Shared Mem ory, Checkpointing.


Reliability of cluster systems can be improved by periodi cally saving checkpoints in stable storage. In case of an error a backward error recovery can restart the cluster from the last checkpoint and thus avoiding a fallback to the initial state. Different strategies originally developed for message-passing systems have been adapted for Dis tributed Shared Memory (DSM) systems. However, it is not sufficient to save only the changed parts of the DSM in a checkpoint. Node-local process contexts must also be included. The latter is hard to realize because the used Operating Systems (OS) are not designed for cluster op eration and checkpointing. In this paper we describe the new distributed Plurix OS extending the well-known Sin gle System Image (SSI) concept by storing everything in a DSM, including kernel code and data. Activity is per formed using restartable transactions that are synchro nized using an optimistic forward validation scheme. We show how these architectural properties simplify the im plementation of checkpointing and recovery. Finally, we present measurement numbers that underpin our approach.

Important Links:

Go Back