Publications
Application-level checkpointing techniques for parallel programs
Abstract
In its simplest form, checkpointing is the act of saving a program’s computation state in a form external to the running program, e.g. the computation state is saved to a filesystem. The checkpoint files can then be used to resume computation upon failure of the original process(s), hopefully with minimal loss of computing work. A checkpoint can be taken using a variety of techniques in every level of the system, from utilizing special hardware/architectural checkpointing features through modification of the user’s source code. This survey will discuss the various techniques used in application-level checkpointing, with special attention being paid to techniques for checkpointing parallel and distributed applications.
- Date
- December 5, 2025
- Authors
- John Walters, Vipin Chaudhary
- Conference
- Distributed Computing and Internet Technology
- Pages
- 221-234
- Publisher
- Springer Berlin/Heidelberg