Principles and Techniques for Robust and High-Performance Cloud-Native Deep Learning Systems

Kang, Zhuangwei

Principles and Techniques for Robust and High-Performance Cloud-Native Deep Learning Systems

dc.creator	Kang, Zhuangwei
dc.date.accessioned	2024-01-29T19:00:23Z
dc.date.created	2023-12
dc.date.issued	2023-10-23
dc.date.submitted	December 2023
dc.identifier.uri	http://hdl.handle.net/1803/18593
dc.description.abstract	Although cloud-native frameworks like Kubernetes (K8s) have become the de facto solutions to orchestrate containerized, cloud-native deep learning (CNDL) applications, five unresolved challenges exist in realizing high-performance and reliable CNDLs. This dissertation provides a holistic solution to address these challenges. First, CNDLs require a robust network stack to ensure efficient inter-node communication to move large volumes of data needed by DL jobs. However, prior studies that comprehensively analyze K8s network stack performance are scarce. To that end, this dissertation systematically evaluates the impact of various container network interfaces (CNI) on K8s. The resulting insights pave the way to design CNI recommendation systems for CNDLs. Second, CNDLs often are provisioned with a high-performance storage layer backed by SSDs or memory to facilitate faster data loading. However, effective research that leverages the strengths of different storage mediums jointly is lacking. Moreover, existing caching mechanisms do not jointly consider storage capacity, access latency, and the actual I/O demands of DL jobs. To address these, we propose a novel multi-tier dataset management solution and a runtime-aware best-effort caching mechanism that can autoscale cache according to runtime configurations, resource constraints, and training speed. Third, low-latency data access, which is crucial to CNDLs, is feasible only with maximal data localization. However, K8s' scheduler is not data-aware for which we introduce a cooperative job scheduling and data placement strategy to enhance data locality. Fourth, it is commonplace to train DL applications on a myriad of small files, which can impose a substantial strain on local or network I/O and storage resources. Existing solutions that address this problem typically employ static methods to consolidate data files but overlook the specific I/O requirements of DL jobs. To remedy this, we introduce a runtime-aware, online data merging technique designed to enhance the data loading throughput. Finally, CNDL systems must continuously deliver reliable performance, which requires real-time anomaly detection to take proactive actions. This dissertation equips the KPI monitoring layer of K8s with a novel Normalizing Flow-based multivariate timeseries anomaly detection method to effectively pinpoint unhealthy nodes.
dc.format.mimetype	application/pdf
dc.language.iso	en
dc.subject	Cloud-native systems, Deep Learning
dc.title	Principles and Techniques for Robust and High-Performance Cloud-Native Deep Learning Systems
dc.type	Thesis
dc.date.updated	2024-01-29T19:00:23Z
dc.type.material	text
thesis.degree.name	PhD
thesis.degree.level	Doctoral
thesis.degree.discipline	Computer Science
thesis.degree.grantor	Vanderbilt University Graduate School
local.embargo.terms	2024-12-01
local.embargo.lift	2024-12-01
dc.creator.orcid	0000-0002-2574-4271
dc.contributor.committeeChair	Gokhale, Aniruddha S

Files in this item

Name:: KANG-DISSERTATION-2023.pdf
Size:: 7.824Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Electronic Theses and Dissertations
Electronic theses and dissertations of masters and doctoral students submitted to the Graduate School.

Show simple item record