Characterizing the Temporal-Dependent System Vulnerability in Deep Neural Network Applications on GPUs for Cost-Effective Fault Tolerant Design

Qiu, Hao

Characterizing the Temporal-Dependent System Vulnerability in Deep Neural Network Applications on GPUs for Cost-Effective Fault Tolerant Design

dc.contributor.advisor	Robinson, William H
dc.contributor.advisor	Zhang, Enxia
dc.creator	Qiu, Hao
dc.date.accessioned	2024-05-15T17:46:13Z
dc.date.created	2024-05
dc.date.issued	2024-01-24
dc.date.submitted	May 2024
dc.identifier.uri	http://hdl.handle.net/1803/19012
dc.description.abstract	With the escalating complexity of Deep Neural Network (DNN) tasks, heterogeneous computing systems are widely employed in real-time, safety-critical systems like autonomous vehicles. These systems, while highly versatile and proficient in parallel computing, face reliability challenges due to factors such as: (1) radiation effects, (2) process, voltage, and temperature (PVT) variations, and (3) potential malicious attacks. Most fault mitigation techniques introduce significant power, performance, or area (PPA) overheads. The cost-reliability tradeoff underscores the need to explore the temporal variation in system resiliency to optimize fault-tolerant design. This dissertation presents an in-depth evaluation of system resiliency across program lifetime using a customized hierarchical fault injection methodology. This approach dynamically segments CUDA applications to enable scalable, lifetime-aware analysis of program vulnerability. The time-dependent vulnerability of four variants of YOLO, a popular real-time object detection framework that uses the Convolutional Neural Network (CNN) as its backbone, is analyzed. Though the average Silent Data Corruption (SDC) rates for YOLO variants are around 5%, the SDC rate can spike as high as 65% at certain phases of execution. The SDC rates of repeated dynamic invocations of the same static kernel also fluctuate significantly; these SDC rates can range from 0% to 30%. We observed that data transformation kernels, which accelerate convolutions and matrix multiplications on GPUs, are likely to demonstrate high vulnerability fluctuations. A case study of temporal-dependent guided selective hardening demonstrated temporal variance of program vulnerability can further improve system resiliency and reduce overheads compared to uniform protection. Results show selective hardening can reduce 56% runtime overheads and keep 97% SDC coverage for implicit convolution kernels. This research also investigates the most prominent computing patterns to characterize the intra-kernel phase-based vulnerability behavior. Temporal variance patterns are observed at the instruction Basic Block level for kernels in implicit convolution routines.
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.subject	Fault injection, GPU, Deep Neural Networks, System reliability, Fault tolerance
dc.title	Characterizing the Temporal-Dependent System Vulnerability in Deep Neural Network Applications on GPUs for Cost-Effective Fault Tolerant Design
dc.type	Thesis
dc.date.updated	2024-05-15T17:46:14Z
dc.type.material	text
thesis.degree.name	PhD
thesis.degree.level	Doctoral
thesis.degree.discipline	Electrical Engineering
thesis.degree.grantor	Vanderbilt University Graduate School
local.embargo.terms	2024-11-01
local.embargo.lift	2024-11-01
dc.creator.orcid	0009-0000-2690-5762
dc.contributor.committeeChair	Robinson, William H
dc.contributor.committeeChair	Zhang, Enxia

Files in this item

Name:: QIU-DISSERTATION-2024.pdf
Size:: 15.74Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Electronic Theses and Dissertations
Electronic theses and dissertations of masters and doctoral students submitted to the Graduate School.

Show simple item record