A Hardware-Aware Failure-Detection Method for GPU Control-Logic

Graphics processing units (GPUs) are used for diverse applications and play a major role even in safety-critical applications. Although performance is usually the primary focus of GPUs, their reliability has become a major concern. One of the undesirable failures in GPUs is silent data corruption (S...

Full description

Saved in:
Bibliographic Details
Main Authors: Hiroaki Itsuji, Takumi Uezono, Tadanobu Toba, Kojiro Ito, Masanori Hashimoto
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11062630/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Graphics processing units (GPUs) are used for diverse applications and play a major role even in safety-critical applications. Although performance is usually the primary focus of GPUs, their reliability has become a major concern. One of the undesirable failures in GPUs is silent data corruption (SDC), which causes unexpected outputs without any warning. Various failure detection methods have been proposed for SDCs caused by faults in data units such as registers. However, effective methods for detecting SDCs resulting from faults in control logic, such as scheduling units, have not yet been established. This paper assumes three types of control-logic failures for a general GPU architecture and proposes efficient failure detection methods for each type. For instance, the proposed method efficiently detects GPU-specific control-logic failures caused by program counter faults with a detection rate of 99.5% and can be implemented with a runtime overhead of 5.3% and a memory-resource overhead of 4.2% for a matrix multiplication application. These methods are applicable to a wide range of applications and are expected to enhance system resiliency.
ISSN:2169-3536