Show simple item record

Principles and Techniques for Robust and High-Performance Cloud-Native Deep Learning Systems

dc.creatorKang, Zhuangwei
dc.date.accessioned2024-01-29T19:00:23Z
dc.date.created2023-12
dc.date.issued2023-10-23
dc.date.submittedDecember 2023
dc.identifier.urihttp://hdl.handle.net/1803/18593
dc.description.abstractAlthough cloud-native frameworks like Kubernetes (K8s) have become the de facto solutions to orchestrate containerized, cloud-native deep learning (CNDL) applications, five unresolved challenges exist in realizing high-performance and reliable CNDLs. This dissertation provides a holistic solution to address these challenges. First, CNDLs require a robust network stack to ensure efficient inter-node communication to move large volumes of data needed by DL jobs. However, prior studies that comprehensively analyze K8s network stack performance are scarce. To that end, this dissertation systematically evaluates the impact of various container network interfaces (CNI) on K8s. The resulting insights pave the way to design CNI recommendation systems for CNDLs. Second, CNDLs often are provisioned with a high-performance storage layer backed by SSDs or memory to facilitate faster data loading. However, effective research that leverages the strengths of different storage mediums jointly is lacking. Moreover, existing caching mechanisms do not jointly consider storage capacity, access latency, and the actual I/O demands of DL jobs. To address these, we propose a novel multi-tier dataset management solution and a runtime-aware best-effort caching mechanism that can autoscale cache according to runtime configurations, resource constraints, and training speed. Third, low-latency data access, which is crucial to CNDLs, is feasible only with maximal data localization. However, K8s' scheduler is not data-aware for which we introduce a cooperative job scheduling and data placement strategy to enhance data locality. Fourth, it is commonplace to train DL applications on a myriad of small files, which can impose a substantial strain on local or network I/O and storage resources. Existing solutions that address this problem typically employ static methods to consolidate data files but overlook the specific I/O requirements of DL jobs. To remedy this, we introduce a runtime-aware, online data merging technique designed to enhance the data loading throughput. Finally, CNDL systems must continuously deliver reliable performance, which requires real-time anomaly detection to take proactive actions. This dissertation equips the KPI monitoring layer of K8s with a novel Normalizing Flow-based multivariate timeseries anomaly detection method to effectively pinpoint unhealthy nodes.
dc.format.mimetypeapplication/pdf
dc.language.isoen
dc.subjectCloud-native systems, Deep Learning
dc.titlePrinciples and Techniques for Robust and High-Performance Cloud-Native Deep Learning Systems
dc.typeThesis
dc.date.updated2024-01-29T19:00:23Z
dc.type.materialtext
thesis.degree.namePhD
thesis.degree.levelDoctoral
thesis.degree.disciplineComputer Science
thesis.degree.grantorVanderbilt University Graduate School
local.embargo.terms2024-12-01
local.embargo.lift2024-12-01
dc.creator.orcid0000-0002-2574-4271
dc.contributor.committeeChairGokhale, Aniruddha S


Files in this item

Icon

This item appears in the following Collection(s)

Show simple item record