dc.description.abstract | With the escalating complexity of Deep Neural Network (DNN) tasks, heterogeneous computing systems are widely employed in real-time, safety-critical systems like autonomous vehicles. These systems, while highly versatile and proficient in parallel computing, face reliability challenges due to factors such as: (1) radiation effects, (2) process, voltage, and temperature (PVT) variations, and (3) potential malicious attacks. Most fault mitigation techniques introduce significant power, performance, or area (PPA) overheads. The cost-reliability tradeoff underscores the need to explore the temporal variation in system resiliency to optimize fault-tolerant design. This dissertation presents an in-depth evaluation of system resiliency across program lifetime using a customized hierarchical fault injection methodology. This approach dynamically segments CUDA applications to enable scalable, lifetime-aware analysis of program vulnerability. The time-dependent vulnerability of four variants of YOLO, a popular real-time object detection framework that uses the Convolutional Neural Network (CNN) as its backbone, is analyzed. Though the average Silent Data Corruption (SDC) rates for YOLO variants are around 5%, the SDC rate can spike as high as 65% at certain phases of execution. The SDC rates of repeated dynamic invocations of the same static kernel also fluctuate significantly; these SDC rates can range from 0% to 30%. We observed that data transformation kernels, which accelerate convolutions and matrix multiplications on GPUs, are likely to demonstrate high vulnerability fluctuations. A case study of temporal-dependent guided selective hardening demonstrated temporal variance of program vulnerability can further improve system resiliency and reduce overheads compared to uniform protection. Results show selective hardening can reduce 56% runtime overheads and keep 97% SDC coverage for implicit convolution kernels. This research also investigates the most prominent computing patterns to characterize the intra-kernel phase-based vulnerability behavior. Temporal variance patterns are observed at the instruction Basic Block level for kernels in implicit convolution routines. | |