A Data-Driven Approach to Optimal Resource Management for Large-Scale Data Processing Platforms

Yan, Wei

A Data-Driven Approach to Optimal Resource Management for Large-Scale Data Processing Platforms

dc.creator	Yan, Wei
dc.date.accessioned	2020-08-22T17:32:47Z
dc.date.available	2016-01-19
dc.date.issued	2015-07-23
dc.identifier.uri	https://etd.library.vanderbilt.edu/etd-07172015-094917
dc.identifier.uri	http://hdl.handle.net/1803/13138
dc.description.abstract	In the era of “Big Data”, a variety of data processing and analysis frameworks (such as MapReduce/Hadoop, Dremel/Impala, and Storm) have emerged as a solution to support large-scale data processing and analysis tasks. The computing tasks from these frameworks are usually deployed and executed over shared computing infrastructures. Resource management of this shared infrastructure plays a central role in consolidating the different resource needs of these jobs, satisfying the individual performance requirements while ensuring the fairness among jobs. Yet designing and implementing a scalable solution for resource management for large-scale data processing platforms remains an open challenge. First, the workload of data processing jobs greatly depends on the input data – not only the data size, but also more importantly, the internal data structure and semantics, which is usually unknown a priori. Second, different data processing jobs are highly diverse in terms of their performance requirements. To address these challenges, this dissertation proposes a data-driven optimal resource management mechanism for large-scale data processing platforms. The proposed approach integrates efficient data profiling with resource management. Based on the knowledge of the job workload through data profiling, the proposed resource management mechanism makes informed scheduling and resource allocation decisions through an optimization framework. This dissertation makes the following contributions: First, it presents an optimizationbased resource management approach for the prevalent MapReduce/Hadoop data processing framework. The performance objective of a MapReduce job is captured by its job completion time, which is determined by the longest reducer task. To capture the data distribution statistics, a scalable data profiling structure is designed and integrated with MapReduce framework. Based on the data profiles, a novel key assignment mechanism assigns appropriate workloads to minimize the load skew and thus optimize the performance of a MapReduce job. Second, it presents an optimal resource allocation solution for large-scale interactive data query systems (e.g., Dremel/Impala) using a utility-based optimization framework. The objective is to optimize the cluster resource utilization, while maximizing the aggregate utility. By profiling the resource consumption for each query, a price-based algorithm allocates resources across multiple concurrent queries efficiently. The utility-based framework allows different fairness criteria to be defined through the definition of utility function (e.g., weighted proportional fairness and max-min fairness).
dc.format.mimetype	application/pdf
dc.subject	Resource management
dc.subject	large-scale
dc.subject	data processing
dc.subject	data profiling
dc.subject	MapReduce/Hadoop
dc.title	A Data-Driven Approach to Optimal Resource Management for Large-Scale Data Processing Platforms
dc.type	dissertation
dc.contributor.committeeMember	Amr A. Awadallah
dc.contributor.committeeMember	Douglas C. Schmidt
dc.contributor.committeeMember	Bradley A. Malin
dc.contributor.committeeMember	Aniruddha S. Gokhale
dc.type.material	text
thesis.degree.name	PHD
thesis.degree.level	dissertation
thesis.degree.discipline	Computer Science
thesis.degree.grantor	Vanderbilt University
local.embargo.terms	2016-01-19
local.embargo.lift	2016-01-19
dc.contributor.committeeChair	Yuan Xue

Files in this item

Name:: Yan.pdf
Size:: 1.557Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Electronic Theses and Dissertations
Electronic theses and dissertations of masters and doctoral students submitted to the Graduate School.

Show simple item record