Parallel Program Debugging based on Data-Replay

M. Maruyama, T. Tsumura, and H. Nakashima (Japan)


parallel debugger, Data-replay, reverse execution, checkpointing


Nondeterministic nature of parallel programs is the major difficulty in debugging. Order-replay, a technique to solve this problem, is widely used because of its small overhead. It has, however, several serious drawbacks: all processes of the parallel program have to participate in replay even when some of them are clearly not involved with the bug; and the programmer cannot stop the process being debugged at an arbitrary point. We adopt another method for de terministic replay, Data-replay, which logs contents of the events rather than their order, and makes it possible to run and stop each process independently. Data-replay is well able to cooperate with reverse execution mechanisms. We applied the Data-replay mechanism to MPI based parallel programs. The result of our experiment with NAS Parallel Benchmarks shows that our mechanism works at a prac tical cost. Logging communicated data incurs only 24 % overhead while it accelerates replayed execution by 38 %, both in average.

Important Links:

Go Back