Show simple item record

Characterizing the Temporal-Dependent System Vulnerability in Deep Neural Network Applications on GPUs for Cost-Effective Fault Tolerant Design

dc.contributor.advisorRobinson, William H
dc.contributor.advisorZhang, Enxia
dc.creatorQiu, Hao 2024
dc.description.abstractWith the escalating complexity of Deep Neural Network (DNN) tasks, heterogeneous computing systems are widely employed in real-time, safety-critical systems like autonomous vehicles. These systems, while highly versatile and proficient in parallel computing, face reliability challenges due to factors such as: (1) radiation effects, (2) process, voltage, and temperature (PVT) variations, and (3) potential malicious attacks. Most fault mitigation techniques introduce significant power, performance, or area (PPA) overheads. The cost-reliability tradeoff underscores the need to explore the temporal variation in system resiliency to optimize fault-tolerant design. This dissertation presents an in-depth evaluation of system resiliency across program lifetime using a customized hierarchical fault injection methodology. This approach dynamically segments CUDA applications to enable scalable, lifetime-aware analysis of program vulnerability. The time-dependent vulnerability of four variants of YOLO, a popular real-time object detection framework that uses the Convolutional Neural Network (CNN) as its backbone, is analyzed. Though the average Silent Data Corruption (SDC) rates for YOLO variants are around 5%, the SDC rate can spike as high as 65% at certain phases of execution. The SDC rates of repeated dynamic invocations of the same static kernel also fluctuate significantly; these SDC rates can range from 0% to 30%. We observed that data transformation kernels, which accelerate convolutions and matrix multiplications on GPUs, are likely to demonstrate high vulnerability fluctuations. A case study of temporal-dependent guided selective hardening demonstrated temporal variance of program vulnerability can further improve system resiliency and reduce overheads compared to uniform protection. Results show selective hardening can reduce 56% runtime overheads and keep 97% SDC coverage for implicit convolution kernels. This research also investigates the most prominent computing patterns to characterize the intra-kernel phase-based vulnerability behavior. Temporal variance patterns are observed at the instruction Basic Block level for kernels in implicit convolution routines.
dc.subjectFault injection, GPU, Deep Neural Networks, System reliability, Fault tolerance
dc.titleCharacterizing the Temporal-Dependent System Vulnerability in Deep Neural Network Applications on GPUs for Cost-Effective Fault Tolerant Design
dc.type.materialtext Engineering University Graduate School
dc.contributor.committeeChairRobinson, William H
dc.contributor.committeeChairZhang, Enxia

Files in this item


This item appears in the following Collection(s)

Show simple item record