FRESH: Fully Reliable and Effective Protection Against Soft and Hard Errors

Advances in modern computing systems have led to an unprecedented spread of safety-critical applications in real-world environments. In safety-critical applications, preventing malfunctions due to faults is a primary design concern, as malfunctions in such applications can induce catastrophic result...

Full description

Saved in:
Bibliographic Details
Main Authors: Daehoon Son, Hwisoo So, Jinhyo Jung, Yohan Ko, Aviral Shrivastava, Kyoungwoo Lee
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11016748/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Advances in modern computing systems have led to an unprecedented spread of safety-critical applications in real-world environments. In safety-critical applications, preventing malfunctions due to faults is a primary design concern, as malfunctions in such applications can induce catastrophic results. Software-level redundant multithreading (RMT) solutions, which do not require hardware modifications and hence do not incur hardware costs, are attractive alternatives to hardware-level redundancy solutions for hardware unreliability issues, such as soft and hard errors. However, existing software-level RMT solutions can only provide fault detection and rely on external schemes for error recovery. This study investigated the potential of software-level RMT schemes for complete soft and hard error detection and recovery. First, a baseline software-level triple redundant multithreading (STRMT) scheme was implemented to serve as a baseline, pinpointing the ineffectiveness of the na&#x00EF;ve STRMT, which makes the application even more vulnerable than the unprotected version due to the runtime overhead. Subsequently, Fully Reliable and Effective protection against Soft and Hard errors (FRESH) was introduced as a software-only RMT scheme that can achieve comprehensive error resiliency against both soft and hard errors. The main idea of FRESH is to distribute and intertwine error detection and recovery operations between redundant threads based on thread-level load-back checking of the state-of-the-art RMT scheme. FRESH further applies a lazy-fault-diagnosis optimization to reduce the number of thread-level synchronizations required for fault detection and recovery. Experimental results with an ARM cortex53-like <inline-formula> <tex-math notation="LaTeX">$\mu $ </tex-math></inline-formula>-architecture simulated microprocessor demonstrated that FRESH can reduce program failure rate by around 99.88% compared to the unprotected versions.
ISSN:2169-3536