Intelligent Fault Tolerant Architecture for Cluster Computing: A High Level Overview

S. Corsava and V. Getov (UK)


Cluster, performance, resource allocation, fault tolerance,parallel processing.


As information technology requirements for high availability and uptime become more important, it is of paramount importance to architect infrastructures and topologies that can comply. Infrastructure downtime results in application unavailability, frustration and financial loss. In order to address the demand for highly reliable complex computer systems, a new concept of infrastructure architecture needs to be put in place. In this paper, we propose a model that can support any type of applications in any configuration such as client server, distributed, parallel processing, peer-to-peer, any middleware and protocols. Our infrastructure can self-configure, self-optimise, self-protect, self-manage and self-heal. The building methodology involves the employment of new types of Unix-based clustered systems using large application/middleware groupings, each having a master cluster controller. Each controlling engine consists of self-healing intelligent entities that can adapt and compensate for a variety of software or hardware problems. It has access to a pool of hardware and software resources to allocate on demand, automatically and transparently and mostly without service interruption. This design forces the infrastructure to make full use of its capacity and be highly fault-tolerant. We also present evaluation results from our pilot implementation that has been working for more than a year and a half within the production environment of a mobile phone/internet service provider.

Important Links:

Go Back