Show simple item record

Using Model-based techniques for improving performance and reliability in high performance scientific computing

dc.creatorDubey, Abhishek
dc.date.accessioned2020-08-21T21:24:05Z
dc.date.available2011-03-28
dc.date.issued2009-03-28
dc.identifier.urihttps://etd.library.vanderbilt.edu/etd-03232009-133814
dc.identifier.urihttp://hdl.handle.net/1803/11104
dc.description.abstractData processing in scientific and workflow-oriented computing is carried out as analysis campaigns, which consist of an input dataset and a set of interdependent jobs. Traditionally, these massively parallel computations required the services of supercomputers. However, recent trends show that the share of scientific computing carried out on clusters of commodity computers is on the rise. Commodity computers yield the highest performance per dollar but exhibit intermittent faults, which can result in systemic failures when operated over long continuous periods for executing analysis campaigns. Diagnosing job problems and failures in this complex environment is difficult, especially when the success of a campaign can be affected by even a single job failure. Manual administration, though essential, is slow to respond to the intermittent faults. Therefore, an autonomic approach is required that can ensure that the resources of the cluster are used to the best possible extent and improve the reliability of jobs, even in the presence of hardware/software failures. Model-based design is a formal system design methodology that has gained momentum in recent years as a sound methodology of applying computer-based modeling and synthesis methods to a variety of problem domains, including distributed systems. A benefit of using formal models is that they can be queried or transformed to produce a variety of domain specific artifacts, which are critical to deployment and execution of the system, but are tedious and error-prone to produce manually. This dissertation presents the design and discusses applicability of a model-based cluster management framework called Scientific Computing Autonomic Reliability Framework (SCARF). Basic components of this framework are distributed monitoring units, fault-mitigation units and a workflow-management system for dealing with workflow-specific concerns in case of failures. Model-based techniques are used to capture workflow specifications, along with pre, post conditions and invariants for checking the validity of system state during execution. Formal data models are used to provide provenance and execution tracking of workflow jobs. Health monitoring is provided by synchronized, light-weight, distributed sensors that are augmented with a real-time fault-mitigation framework. This framework consists of hierarchical fault management entities called reflex engines, which use a timed automaton based abstraction for capturing failure management strategies. These engines track the state of components under their management zone and initiate reflexive mitigation actions upon occurrence of certain events or timeouts. This mitigation framework is verified against properties written in timed computation tree logic (TCTL).
dc.format.mimetypeapplication/pdf
dc.subjectscientific computing
dc.subjectsoftware health management
dc.subjectautonomic computing
dc.subjectElectronic data processing -- Distributed processing -- Reliability
dc.subjectComputer software -- Reliability
dc.subjectComputer system failures -- Prevention
dc.subjectFault-tolerant computing
dc.titleUsing Model-based techniques for improving performance and reliability in high performance scientific computing
dc.typedissertation
dc.contributor.committeeMemberTheodore Bapty
dc.contributor.committeeMemberSandeep neema
dc.contributor.committeeMemberSherif Abdelwahed
dc.contributor.committeeMemberPaul Sheldon
dc.type.materialtext
thesis.degree.namePHD
thesis.degree.leveldissertation
thesis.degree.disciplineElectrical Engineering
thesis.degree.grantorVanderbilt University
local.embargo.terms2011-03-28
local.embargo.lift2011-03-28
dc.contributor.committeeChairGabor Karsai


Files in this item

Icon

This item appears in the following Collection(s)

Show simple item record