A profiling and performance analysis based self-tuning system for optimization of Hadoop MapReduce cluster configuration
As a parallel data processing framework, MapReduce has been proved to be one of the most popular topics in the age of Cloud Computing since it was firstly proposed by Google. Despite its advantages such as scalability, reliability and flexibility, how to manage the resources of a MapReduce cluster and thus optimize the performance of MapReduce applications running on it is still one major issue in this field. This thesis introduces PPABS, a profiling and performance analysis based system for performance optimization of a Hadoop cluster by automatically tuning its configuration settings. The entire process of PPABS can be described as follows. First of all, Profiling of MapReduce job performance and Data Mining technique were combined in this system to dynamically divide jobs into groups. Secondly, Simulated Annealing, a probabilistic metaheuristic algorithm for global optimization, was imported and modified to find the optimum solution and tune the cluster configuration for the job groups we got from the first step. Thirdly, after running an incoming job with only a small part of its input data set, Pattern Recognition technique was also used to classify this new job. And finally, the cluster configuration would be updated by PPABS to match this job's features before running the whole job. The experimental results were very promising and showed the effectiveness of our approach in improving the performance of several Hadoop jobs running with cluster built on Amazon EC2.