Recent News
Partnering for success: Computer Science students represent UNM in NASA and Supercomputing Competitions
December 11, 2024
New associate dean interested in helping students realize their potential
August 6, 2024
Hand and Machine Lab researchers showcase work at Hawaii conference
June 13, 2024
Two from School of Engineering to receive local 40 Under 40 awards
April 18, 2024
News Archives
[Colloquium] SCR: The Scalable Checkpoint/Restart Library
January 26, 2012
Watch Colloquium:
M4V file (661 MB)
- Date: Thursday, January 26, 2012
- Time: 11:00 am — 12:15 pm
- Place: Mechanical Engineering 218
Kathryn Mohror
Lawrence Livermore National Lab
Applications running on high-performance computing systems can encounter mean times between failures on the order of hours or days. Commonly, applications tolerate failures by periodically saving their state to checkpoint files on reliable storage, typically a parallel file system. Writing these checkpoints can be expensive at large scale, taking tens of minutes to complete. To address this problem, we developed the Scalable Checkpoint/Restart library (SCR). SCR is a multi-level checkpointing library; it checkpoints to storage on the compute nodes in addition to the parallel file system. Through experiments and modeling, we show that multi-level checkpointing benefits existing systems, and we find that the benefits increase on larger systems. In particular, we developed low-cost checkpoint schemes that are 100x-1000x faster than the parallel file system and effective against 85% of our system failures. Our approach improves machine efficiency up to 35%, while reducing the load on the parallel file system by a factor of two.
Bio: Kathryn Mohror is a Postdoctoral Research Staff Member at the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory. Kathryn.s research on high-end computing systems is currently focused on scalable fault tolerant computing and performance measurement and analysis. Her other research interests include scalable automated performance analysis and tuning, parallel file systems, and parallel programming paradigms. Kathryn received her Ph.D. in Computer Science in 2010, an M.S. in Computer Science in 2004, and a B.S. in Chemistry in 1999 from Portland State University in Portland, OR.