Fault Tolerance and Robustness
Robust-first Computing
Efficiency costs robustness. For the safety of society and to let us build really big computers, we should put robustness first, ahead even of strict correctness and maximum efficiency, robust-first computing embodies this across the entire computational stack.
NM Investigators: David Ackley and Lance Williams
Robust Communication and Computation
Secure and robust multiparty computations or communication in networks with adversarial nodes is important to large scale systems. This work addresses resource-efficient and cost-competitive algorithms in these contexts.
UNM Investigator: Jared Sala
Collaborators: Drexel U., U. of Michigan, U. of Victona
Fault-tolerance for HPC Systems
To address the challenges of running applications on next- generation, large-scale, error-prone systems, we use modeling simulation and real frameworks to understand the impart of different resilience mechanisms or application performance.
UNM Investigators: Dorian Arnold
Collaborators: Sandia National Labs