Enhancing Checkpoint/Restart
In an attempt to keep checkpoint/restart (CR) viable for future extreme-scale systems, we study CR protocol performance and optimizations including:- Compression: an application independent way to decrease the sizes of checkpoints and message logs;
- Incremental Checkpointing: a low overhead, hash-based approach that only saves changes since last checkpoint;
- Uncoordinated Checkpointing: understanding how collective communication patterns impact protocol performance;
- Task replication: studying how replication can help to lower overheads on future sytems.
This project is part of a UNM/SNL collaboration.
Publications
Loading publications...