• About
    • Login
    View Item 
    •   Institutional Repository Home
    • Electronic Theses and Dissertations
    • Electronic Theses and Dissertations
    • View Item
    •   Institutional Repository Home
    • Electronic Theses and Dissertations
    • Electronic Theses and Dissertations
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Browse

    All of Institutional RepositoryCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    LoginRegister

    Using Model-based techniques for improving performance and reliability in high performance scientific computing

    Dubey, Abhishek
    : https://etd.library.vanderbilt.edu/etd-03232009-133814
    http://hdl.handle.net/1803/11104
    : 2009-03-28

    Abstract

    Data processing in scientific and workflow-oriented computing is carried out as analysis campaigns, which consist of an input dataset and a set of interdependent jobs. Traditionally, these massively parallel computations required the services of supercomputers. However, recent trends show that the share of scientific computing carried out on clusters of commodity computers is on the rise. Commodity computers yield the highest performance per dollar but exhibit intermittent faults, which can result in systemic failures when operated over long continuous periods for executing analysis campaigns. Diagnosing job problems and failures in this complex environment is difficult, especially when the success of a campaign can be affected by even a single job failure. Manual administration, though essential, is slow to respond to the intermittent faults. Therefore, an autonomic approach is required that can ensure that the resources of the cluster are used to the best possible extent and improve the reliability of jobs, even in the presence of hardware/software failures. Model-based design is a formal system design methodology that has gained momentum in recent years as a sound methodology of applying computer-based modeling and synthesis methods to a variety of problem domains, including distributed systems. A benefit of using formal models is that they can be queried or transformed to produce a variety of domain specific artifacts, which are critical to deployment and execution of the system, but are tedious and error-prone to produce manually. This dissertation presents the design and discusses applicability of a model-based cluster management framework called Scientific Computing Autonomic Reliability Framework (SCARF). Basic components of this framework are distributed monitoring units, fault-mitigation units and a workflow-management system for dealing with workflow-specific concerns in case of failures. Model-based techniques are used to capture workflow specifications, along with pre, post conditions and invariants for checking the validity of system state during execution. Formal data models are used to provide provenance and execution tracking of workflow jobs. Health monitoring is provided by synchronized, light-weight, distributed sensors that are augmented with a real-time fault-mitigation framework. This framework consists of hierarchical fault management entities called reflex engines, which use a timed automaton based abstraction for capturing failure management strategies. These engines track the state of components under their management zone and initiate reflexive mitigation actions upon occurrence of certain events or timeouts. This mitigation framework is verified against properties written in timed computation tree logic (TCTL).
    Show full item record

    Files in this item

    Icon
    Name:
    Abhishek_DissertationFinal.pdf
    Size:
    18.85Mb
    Format:
    PDF
    View/Open

    This item appears in the following collection(s):

    • Electronic Theses and Dissertations

    Related items

    Showing items related by title, author, creator and subject.

    • HybrIDS: Embeddable Hybrid Intrusion Detection System 
      Lauf, Adrian Peter (2007-12-18)
      In order to provide preventative security to a homogeneous device network, techniques in addition to static encryption must be implemented to assure network integrity by identifying possible deviant nodes within the ...
    • A thread-safe implementation of a meta-programmable data model 
      Balasubramanian, Daniel Allen (2008-04-25)
      This thesis describes the design and implementation of a thread-safe meta-programmable data model that can be used in a multi-threaded environment without the need for user defined synchronization. The locking mechanisms ...
    • Distributed and Adaptive Parallel Computing for Computational Finance Applications 
      Varshneya, Pooja (2010-06-08)
      Analysts, scientist, engineers, and multimedia professionals require massive processing power to analyze financial trends, create test simulations, model climate, compile code, render video, decode genomes and other complex ...

    Connect with Vanderbilt Libraries

    Your Vanderbilt

    • Alumni
    • Current Students
    • Faculty & Staff
    • International Students
    • Media
    • Parents & Family
    • Prospective Students
    • Researchers
    • Sports Fans
    • Visitors & Neighbors

    Support the Jean and Alexander Heard Libraries

    Support the Library...Give Now

    Gifts to the Libraries support the learning and research needs of the entire Vanderbilt community. Learn more about giving to the Libraries.

    Become a Friend of the Libraries

    Quick Links

    • Hours
    • About
    • Employment
    • Staff Directory
    • Accessibility Services
    • Contact
    • Vanderbilt Home
    • Privacy Policy