Towards Efficient Fault Tolerance in Future Computer Systems

Lecture / Panel
For NYU Community

Speaker: Professor Chengmo Yang 

Host Faculty: Professor Ramesh Karri


While advances of semiconductor technology enable more and more cores to be integrated on a single chip, the underlying computational fabric is at the same time becoming increasingly unreliable. The transistors are pushed to operate near their quantum limit, raising an extraordinary challenge of guaranteeing application correctness in the face of elevated rate of transient, intermittent, and permanent errors. In such a highly unreliable environment, the challenge is not just to guarantee full fault resilience, but furthermore to provide resilience support in conjunction with the goals that designers already face, such as high performance, low power and low hardware cost.

To address this challenge, this talk presents two tightly-coupled techniques that detect hardware faults and recover from them within minimum amount of comparison and checkpointing operations. Only the instruction results that either influence the final program results or are needed during re-execution are compared for fault detection. Meanwhile, the main memory is protected against contamination by execution faults, thus drastically reducing the checkpointing overhead. These two techniques can be implemented through a minimum hardware extension to the register file and the cache. Their ability of delivering full resilience within maximum efficiency will broaden the applicability of redundant execution to systems of tight power and resource constraints.

About the Speakers

Dr. Chengmo Yang received a B.S. degree in Microelectronics from Peking University, China in 2003, a M.S. and a Ph.D. degree in Computer Engineering from the University of California, San Diego in 2005 and 2010, respectively. She is currently an assistant professor in the Department of Electrical and Computer Engineering at the University of Delaware. Her research interests lie in the broad areas of computer architecture and embedded systems, with a particular focus on the development of reliable and power-efficient multi/many core systems. She is currently recruiting Ph.D. students.