Recent News
Partnering for success: Computer Science students represent UNM in NASA and Supercomputing Competitions
December 11, 2024
New associate dean interested in helping students realize their potential
August 6, 2024
Hand and Machine Lab researchers showcase work at Hawaii conference
June 13, 2024
Two from School of Engineering to receive local 40 Under 40 awards
April 18, 2024
News Archives
[Colloquium] Fault-tolerant solvers via algorithm/system codesign
January 22, 2013
Watch Colloquium:
M4V file (803 MB)
- Date: Tuesday, January 22, 2013
- Time: 11:00 am — 11:50 am
- Place: Mechanical Engineering 218
Mark Hoemmen
Sandia National Laboratories USA
Protecting arithmetic and data from corruption due to hardware errors costs energy. However, energy increasingly constrains modern computer hardware, especially for the largest parallel computers being built and planned today. As processor counts continue to grow, it will become too expensive to correct all of these “soft errors” at system levels, before they reach user code. However, many algorithms only need reliability for certain data and phases of computation, and can be designed to recover from some corruption. This suggests an algorithm / system codesign approach. We will show that if the system provides a programming model to applications that lets them apply reliability only when and where it is needed, we can develop “fault-tolerant” algorithms that compute the right answer despite hardware errors in arithmetic or data. We will demonstrate this for a new iterative linear solver we call “Fault-Tolerant GMRES” (FT-GMRES). FT-GMRES uses a system framework we developed that lets solvers control reliability per allocation and provides fault detection. This project has also inspired a fruitful collaboration between numerical algorithms developers and traditional “systems” researchers. Both of these groups have much to learn from each other, and will have to cooperate more to achieve the promise of exascale.
Bio: Mark Hoemmen is a staff member at Sandia National Laboratories in Albuquerque. He finished his PhD in computer science at the University of California Berkeley in spring 2010. Mark has a background in numerical linear algebra and performance tuning of scientific codes. He is especially interested in the interaction between algorithms, computer architectures, and computer systems, and in programming models that expose the right details of the latter two to algorithms. He also spends much of his time working on the Trilinos library of (trilinos.sandia.gov).