Recent News

Two from School of Engineering to receive local 40 Under 40 awards
April 18, 2024

Making waves: Undergraduate combines computer science skills, love of water for summer internship
April 9, 2024

Inaugural School of Engineering Teaching Innovation Fellows selected
February 2, 2024

UNM computer scientist wins NSF CAREER Award to optimize supercomputer performance
February 1, 2024

News Archives

UNM
>Home
>News
>2012
>August
>[Colloquium] On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-based Fault Tolerance

[Colloquium] On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-based Fault Tolerance

August 24, 2012

Watch Colloquium:

M4V file (330 MB)

Date: Friday, August 24, 2012
Time: 12:00 pm — 12:50 pm
Place: Centennial Engineering Center 1041

Dewan Ibtesham
Department of Computer Science University of New Mexico

The increasing size and complexity of high performance computing (HPC) systems have led to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Previous studies have shown that state-of-the-field checkpoint/restart mechanisms will not scale sufficiently for future generation systems. Therefore, optimizations that reduce checkpoint overheads are necessary to keep checkpoint/restart mechanisms effective. In this work, we demonstrate that checkpoint data compression is a feasible mechanism for reducing checkpoint commit latencies and storage overheads. Leveraging a simple model for checkpoint compression viability, we show: (1) checkpoint data compression is feasible for many types of scientific applications expected to run on extreme scale systems; (2) checkpoint compression viability scales with checkpoint size; (3) user-level versus system-level checkpoints bears little impact on checkpoint compression viability; and (4) checkpoint compression viability scales with application process count. Lastly, we describe the impact that checkpoint compression might have on future generation extreme scale systems.

Bio: Dewan Ibtesham is a third year PhD student advised by Professor Dorian Arnold within the UNM Department of Computer Science. He received his bachelors degree in Computer Science and Engineering from BUET (Bangladesh University of Engineering Technology). After working two and a half years in the software industry, he moved to the U.S. and started graduate school beginning fall 2009. His research interests are generally in high performance computing and large scale distributed systems; in particular, making sure that the HPC systems are fault tolerant and reliable for users so that the full potential of the systems are properly utilized.

Recent News

News Archives

[Colloquium] On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-based Fault Tolerance

Contact Info:

Location:

SOE Links

Useful Links