[Colloquium] Fault-Tolerance for Extreme Scale Systems-A Systems Level Perspective

May 2, 2013

  • Date: Thursday, May 2, 2013 
  • Time: 11:00 am — 12:30 pm 
  • Place: Mechanical Engineering 218

Kurt Ferreira
Sandia National Laboratories 

Achieving the next three orders of magnitude performance increase to move from petascale to exascale computing will require significant advancements in several fundamental areas. Recent reports from the U.S. Department of Energy place resilience as as one of these challenges. This resilience challenge is cross cutting and will likely require advancements in multiple layers in the systems software stack of these extreme-scale systems, from the OS to the application. In this, I will summarize current work at Sandia National Laboratories to address this important challenge. I will characterize this challenge in the context of extreme-scale capability computing, outline current approaches and their benefits, and point out unexplored areas where more work is needed.


Bio: Kurt Ferreira A senior member of Sandia’s technical staff, Kurt Ferreira is an expert on system software and resilience/fault-tolerance methods for large-scale, massively parallel, distributed-memory, scientific computing systems. Kurt has designed and developed many innovative, high-performance, and resilient implementations of low-level system software for a number of HPC platforms at Sandia National Laboratories. His research interests include the design and construction of operating systems for massively parallel processing machines and innovative application- and system-level fault-tolerance mechanisms for HPC.