Show simple item record

Characterizing the Temporal-Dependent System Vulnerability in Deep Neural Network Applications on GPUs for Cost-Effective Fault Tolerant Design

dc.contributor.advisorRobinson, William H
dc.contributor.advisorZhang, Enxia
dc.creatorQiu, Hao
dc.date.accessioned2024-05-15T17:46:13Z
dc.date.created2024-05
dc.date.issued2024-01-24
dc.date.submittedMay 2024
dc.identifier.urihttp://hdl.handle.net/1803/19012
dc.description.abstractWith the escalating complexity of Deep Neural Network (DNN) tasks, heterogeneous computing systems are widely employed in real-time, safety-critical systems like autonomous vehicles. These systems, while highly versatile and proficient in parallel computing, face reliability challenges due to factors such as: (1) radiation effects, (2) process, voltage, and temperature (PVT) variations, and (3) potential malicious attacks. Most fault mitigation techniques introduce significant power, performance, or area (PPA) overheads. The cost-reliability tradeoff underscores the need to explore the temporal variation in system resiliency to optimize fault-tolerant design. This dissertation presents an in-depth evaluation of system resiliency across program lifetime using a customized hierarchical fault injection methodology. This approach dynamically segments CUDA applications to enable scalable, lifetime-aware analysis of program vulnerability. The time-dependent vulnerability of four variants of YOLO, a popular real-time object detection framework that uses the Convolutional Neural Network (CNN) as its backbone, is analyzed. Though the average Silent Data Corruption (SDC) rates for YOLO variants are around 5%, the SDC rate can spike as high as 65% at certain phases of execution. The SDC rates of repeated dynamic invocations of the same static kernel also fluctuate significantly; these SDC rates can range from 0% to 30%. We observed that data transformation kernels, which accelerate convolutions and matrix multiplications on GPUs, are likely to demonstrate high vulnerability fluctuations. A case study of temporal-dependent guided selective hardening demonstrated temporal variance of program vulnerability can further improve system resiliency and reduce overheads compared to uniform protection. Results show selective hardening can reduce 56% runtime overheads and keep 97% SDC coverage for implicit convolution kernels. This research also investigates the most prominent computing patterns to characterize the intra-kernel phase-based vulnerability behavior. Temporal variance patterns are observed at the instruction Basic Block level for kernels in implicit convolution routines.
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectFault injection, GPU, Deep Neural Networks, System reliability, Fault tolerance
dc.titleCharacterizing the Temporal-Dependent System Vulnerability in Deep Neural Network Applications on GPUs for Cost-Effective Fault Tolerant Design
dc.typeThesis
dc.date.updated2024-05-15T17:46:14Z
dc.type.materialtext
thesis.degree.namePhD
thesis.degree.levelDoctoral
thesis.degree.disciplineElectrical Engineering
thesis.degree.grantorVanderbilt University Graduate School
local.embargo.terms2024-11-01
local.embargo.lift2024-11-01
dc.creator.orcid0009-0000-2690-5762
dc.contributor.committeeChairRobinson, William H
dc.contributor.committeeChairZhang, Enxia


Files in this item

Icon

This item appears in the following Collection(s)

Show simple item record