Principles and Techniques for Performance Management and Validation of Cloud Hosted Distributed Applications
Barve, Yogesh Damodar
Advances in distributed simulations and machine learning techniques have enabled a new class of workloads to migrate to cloud computing platforms to take advantage of their scalability and elastic properties. These applications are characterized by distributed computations that are hosted across heterogeneous cloud resources. It is in this context that multiple challenges manifest in terms of resource and performance management, monitoring and deployment, and accessibility. For example, cloud platforms may exhibit substantial performance variability for applications running in their multi-tenant environments caused due to interference among noisy neighbors, which can result in violations of application service level objectives (SLOs). Second, individual executing components of these distributed applications may exhibit variability in execution times, which can result in stragglers that can degrade the overall performance. Third, given the large resource configuration selections available in the cloud platforms and heterogeneous computations exhibited by tasks within the distributed application, determining the optimal set of resource configurations that can meet SLOs of these distributed application while minimizing operating costs becomes a complex problem. Fourth, application developers often do not possess the expertise needed in deploying and monitoring these large-scale distributed applications. Finally, validating the solutions that address these challenges itself is a hard problem as it requires setting up large-scale experimental setups. To address the plethora of challenges, this dissertation makes the following research contributions: First, it develops a systematic and low-cost methodology for interference-aware performance modeling and benchmarking. Second, it describes model-driven techniques to allow for easy configuration and deployment of monitoring infrastructure needed for benchmarking. Third, it presents a Co-simulation-as-a-Service (CaaS) platform to support distributed co-simulations in the cloud and a resource recommendation engine that minimizes the impact of stragglers and improves the makespan. Finally, it develops a rapid validation platform that has both practical and educational benefits.