Using Virtualization to Validate Fault-Tolerant Distributed Systems

I. Hsu, A. Gallagher, M. Le, and Y. Tamir (USA)


Fault Injection, Dependability, Validation Tools


Asynchronous events and complex system state distributed across independent nodes make exposure and diagnosis of flaws in distributed systems a challenge. The difficulties are exacerbated when the goal is to validate fault tolerance mechanisms that are activated only by the occurrence of errors, which are, by nature, rare. Validation of fault tolerance mechanisms is often done by injecting faults that emulate the actual faults and ‘‘stress’’ the functionality of the resilience mechanisms. Validation campaigns lasting days and involving thousands of fault injections are often necessary. We present an infrastructure that combines virtualization and software-implemented fault injection to automate validation campaigns and support the analysis of the behavior of a distributed system under test. Virtualization enables: 1) a flexible fault injector capable of emulating a wide variety of faults, and 2) a mechanism for autonomously recovering faulty nodes so that the campaign can continue running on a target system that is fully functional. As a case study we use this infrastructure to validate a Byzantine-fault-tolerant cluster manager. Over 1280 hours of fault injections yielded the exposure of 11 unique flaws in the cluster manager.

Important Links:

Go Back