Component-based Fault Tolerance for Distributed Real-Time and Embedded Systems
Component middleware has become increasingly important in distributed real-time and embedded (DRE) systems. DRE systems are characterized by resource constraints and stringent quality of service (QoS) requirements. Growing demands on system dependability in turn increases the importance of fault-tolerance as a QoS aspect. Research on fault-tolerance in DRE systems has focused mainly on replication and recovery on the granularity level of single distributed objects and processes. Component middleware provides higher-level abstractions, such as a container infrastructure, means to assemble components to larger units of functionality, and standardized deployment mechanisms. These mechanisms provide new opportunities to standardize fault-tolerance, but also pose new challenges, such as efficient synchronization of internal component state, failure correlation across groups of components and configuration of fault-tolerance properties per component. This thesis makes three contributions to the research on component-based fault-tolerance. First, we present Components with HEterogeneous State Synchronization (CHESS), which is a mechanism for component state replication that enables the flexible use of the most appropriate communication mechanism. Second, we present COmponent Replication based on Failover Units (CORFU) that provides fail-stop behavior and fault correlation across groups of components. Third, we present an evaluation of the proposed solutions in comparison to existing object fault-tolerance methods. These results show that DRE systems based on component middleware ease the burden of application development by providing middleware support for fault-tolerance on the level of components. The results also quantify the performance trade-off compared to object level fault-tolerance and show that it is acceptable for many DRE systems.