Failure Prediction and Scalable Checkpointing for Reliable Large-Scale Grid Computing

Abstract

Computational clusters, the grids that federate them, and the applications that utilize their significant computing potential, all continue to grow with advances in hardware technology, cluster management, and grid middleware solutions. As they do, the likelihood that large-scale long-running grid and cluster applications will have to deal with underlying node unavailability and cluster failure increases as well. The primary weapons against this problem—checkpointing, migration, replication, and effective scheduling—do not currently scale well enough to be effective for the largest, most important grid and cluster applications. Complementary research efforts in upstate New York are beginning to address this issue at a variety of levels, including:(i) low level mechanisms that will predict individual processor failures by observing and reacting to low-level indicators in their chip state;(ii) scalable cluster-level checkpointing solutions that do not require centralized storage for replicated checkpoints;(iii) grid-level efforts to differentiate between different node unavailability states, to characterize the behavior of nodes, to predict their near-future unavailability, and to make better grid scheduling decisions based on this information, and on characteristics and capabilities of applications.

Date: January 1, 1970
Authors: Brent Rood, John Paul Walters, Vipin Chaudhary, Michael J Lewis
Conference: IEEE HPDC’07

View Paper

Information Sciences Institute

Publications

Failure Prediction and Scalable Checkpointing for Reliable Large-Scale Grid Computing

Abstract