Show simple item record

A Data-Driven Approach to Optimal Resource Management for Large-Scale Data Processing Platforms

dc.creatorYan, Wei
dc.date.accessioned2020-08-22T17:32:47Z
dc.date.available2016-01-19
dc.date.issued2015-07-23
dc.identifier.urihttps://etd.library.vanderbilt.edu/etd-07172015-094917
dc.identifier.urihttp://hdl.handle.net/1803/13138
dc.description.abstractIn the era of “Big Data”, a variety of data processing and analysis frameworks (such as MapReduce/Hadoop, Dremel/Impala, and Storm) have emerged as a solution to support large-scale data processing and analysis tasks. The computing tasks from these frameworks are usually deployed and executed over shared computing infrastructures. Resource management of this shared infrastructure plays a central role in consolidating the different resource needs of these jobs, satisfying the individual performance requirements while ensuring the fairness among jobs. Yet designing and implementing a scalable solution for resource management for large-scale data processing platforms remains an open challenge. First, the workload of data processing jobs greatly depends on the input data – not only the data size, but also more importantly, the internal data structure and semantics, which is usually unknown a priori. Second, different data processing jobs are highly diverse in terms of their performance requirements. To address these challenges, this dissertation proposes a data-driven optimal resource management mechanism for large-scale data processing platforms. The proposed approach integrates efficient data profiling with resource management. Based on the knowledge of the job workload through data profiling, the proposed resource management mechanism makes informed scheduling and resource allocation decisions through an optimization framework. This dissertation makes the following contributions: First, it presents an optimizationbased resource management approach for the prevalent MapReduce/Hadoop data processing framework. The performance objective of a MapReduce job is captured by its job completion time, which is determined by the longest reducer task. To capture the data distribution statistics, a scalable data profiling structure is designed and integrated with MapReduce framework. Based on the data profiles, a novel key assignment mechanism assigns appropriate workloads to minimize the load skew and thus optimize the performance of a MapReduce job. Second, it presents an optimal resource allocation solution for large-scale interactive data query systems (e.g., Dremel/Impala) using a utility-based optimization framework. The objective is to optimize the cluster resource utilization, while maximizing the aggregate utility. By profiling the resource consumption for each query, a price-based algorithm allocates resources across multiple concurrent queries efficiently. The utility-based framework allows different fairness criteria to be defined through the definition of utility function (e.g., weighted proportional fairness and max-min fairness).
dc.format.mimetypeapplication/pdf
dc.subjectResource management
dc.subjectlarge-scale
dc.subjectdata processing
dc.subjectdata profiling
dc.subjectMapReduce/Hadoop
dc.titleA Data-Driven Approach to Optimal Resource Management for Large-Scale Data Processing Platforms
dc.typedissertation
dc.contributor.committeeMemberAmr A. Awadallah
dc.contributor.committeeMemberDouglas C. Schmidt
dc.contributor.committeeMemberBradley A. Malin
dc.contributor.committeeMemberAniruddha S. Gokhale
dc.type.materialtext
thesis.degree.namePHD
thesis.degree.leveldissertation
thesis.degree.disciplineComputer Science
thesis.degree.grantorVanderbilt University
local.embargo.terms2016-01-19
local.embargo.lift2016-01-19
dc.contributor.committeeChairYuan Xue


Files in this item

Icon

This item appears in the following Collection(s)

Show simple item record