CURRENT RESEARCH AND PRACTICE IN PROACTIVE FAULT MANAGEMENT

Y. Li and Z. Lan

References

  1. [1] S. Chakravorty, C. Mendes, & L. Kale, Proactive fault tolerance in large systems, Proc. HPCRI Workshop, 2005.
  2. [2] S. Pertet & P. Narasimhan, Proactive recovery in distributed CORBA applications, Proc. Int. Conf. on Dependable Systems and Networks, 2004, 357–366.
  3. [3] V. Castelli et al., Proactive management of software aging, IBM Journal of Research and Development, 45 (2), 2001, 331.
  4. [4] Y. Huang, C. Kintala, N. Kolettis, & N. Fulton, Software rejuvenation: Analysis, module, and applications, Proc. Int. Symp. on Fault-Tolerance Computing, 1995, 381–390.
  5. [5] A. Avizienis, J.-C. Laprie, & B. Randell, Dependability and its threats – A taxonomy, IFIP Congress Topical Sessions, 2004, 91–120.
  6. [6] D. Tang, R. Iyer, & S. Subramani, Failure analysis and modelling of a VAXcluster system, Proc. Int. Symp. Fault Tolerance Computing, 1990, 244–251. doi:10.1109/FTCS.1990.89372
  7. [7] J. Xu, Z. Kalbarczyk, & R. Iyer, Networked windows NT system filed failure data analysis, Proc. Pacific Rim Int. Symp. on Dependable Computing, 1999.
  8. [8] R. Sahoo, A. Sivasubramaniam, M. Squillante, & Y. Zhang, Failure data analysis of a large-scale heterogeneous server environment, Proc. Int. Conf. on Dependable Systems and Networks, 2004, 772.
  9. [9] C. Lu, Scalable diskless checkpointing for large parallel systems, Ph.D. Thesis, University of Illinois, Urbana-Champaign, 2005.
  10. [10] J. Brevik, D. Nurmi, & R. Wolski, Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems, Proc. 2004 IEEE Int. Symp. on Cluster Computing and the Grid, 190–199.
  11. [11] C. Leangsuksun, L. Shen, & S. Scott, Availability prediction and modelling of high availability OSCAR cluster, Proc. IEEE Int. Conf. on Cluster, 2003, 380–386. doi:10.1109/CLUSTR.2003.1253337
  12. [12] J. Hellerstein, F. Zhang, & P. Shahabuddin, A statistical approach to predictive detection, Computer Networks: The International Journal of Computer and Telecommunications Networking, 2001, 77–95.
  13. [13] R. Vilalta et al., Predictive algorithms in the management of computer systems, IBM Systems Journal, 41(3), 2002.
  14. [14] R. Sahoo et al., Critical event prediction for proactive management in large-scale computer clusters, Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 2003, 426–435.
  15. [15] G. Hamerly & C. Elkan, Bayesian approaches to failure prediction for disk drives, Proc. 18th Int. Conf. on Machine Learning, 2001, 1–9.
  16. [16] G. Hoffmann, F. Salfner, & M. Malek, Advanced failure prediction in complex software systems, Research Report Number 172, Department of Computer Science, Humboldt University, Berlin, 2004.
  17. [17] F. Salfner, Predicting failures with hidden Markov models, Proc. 5th European Dependable Computing Conf., April 2005.
  18. [18] S. Garg, A. Puliafito, & K. Trivedi, Analysis of software rejuvenation using Markov regenerative stochastic petri net, Proc. 6th Int. Symp. on Software Reliability Engineering, 1995, 180–187.
  19. [19] Y. Li & Z. Lan, Exploit failure prediction for adaptive fault-tolerance in cluster computing, Proc. IEEE/ACM Int. Symp. on Cluster Computing and the Grid (CCGrid06), 2006, 531–538.
  20. [20] Y. Zhang et al., Performance implications of failures in largescale cluster scheduling, Proc. 10th Workshop on Job Scheduling Strategies for Parallel Processing, 2004, 233–252.
  21. [21] A.J. Oliner, R.K. Sahoo, & A. Sivasubramaniam, Fault-aware job scheduling for BlueGene/L systems, Proc. 18th Int. Parallel and Distributed Processing Symp., 2004, 64. doi:10.1109/IPDPS.2004.1302991
  22. [22] R.L. Graham, G.M. Shipman, B.W. Barrett, R.H. Castain, G. Bosilca, & A. Lumsdaine, Open MPI: A high-performance, heterogeneous MPI, Proc. HeteroPar, 2006.

Important Links:

Go Back