On fault-tolerant data-paths

There are various ways to implement redundant (fault-tolerant) data-paths with OpenSplice:

using IP-level 3^rd party solutions
- many adapters nowadays support the notion of ‘channel-bonding’ where multiple network interfaces on a host are combined for redundancy or increased throughput
- as this solution works below socket-level (on packet or data-link layer), it functions transparently for the middleware
- this solution is transparent for the middleware user as well as the middleware itself and is considered a preferred solution
by configuring multiple network-services (RTNetworking only)
- OpenSplice allows to deploy multiple network-services in its ‘federated deployment mode’ where each networking service is attached to a different network-interface and thus creating replication of fault-tolerant data flows
- by using source-timestamp based ordering combined with the DDS-pattern that old data will never overwrite/replace new-data, the multiple streams are naturally ‘fused’ at the receiving side
- this solution is transparent for the middleware user, yet induces more processing power at the ‘receiving side’
by utilizing multiple partitions
- OpenSplice allows explicit mapping of logical DDS_partitions to physical ‘Network-Partitions’ which typically are multicast groups. By configuring the OS in such way that different physical network-interfaces route different multicast-groups, data can be replicated over multiple routes.
- This solution is not transparent in that it both requires proper routing being setup on the OS-level as well as requires (a) the data to be written into distinct DDS-partitions, (b) networkPartitions being configured that map these logical partitions on physical multicast-groups and (c) subscribing applications to be ‘connected’ to the set of written logical DDS partitions
by replicating applications
- as each application can be ‘tied’ to communicate over a specific interface, fault-tolerance can be achieved by active replication of (publishing) applications
- replication management is not part of DDS so that is not transparent
- the replicated applications themselves don’t need to be aware that they are replicated as again duplicate information will be filtered-out automatically provided ordering by source-timestamp is used (which in itself implies a sufficient level of clock-synchronization between nodes on the network)
by utilizing DDS-standard ‘ownership strength’
- The DDS standard includes support for replication by offering ‘ownership strength’ as a writer’s QoS-attribute that assigns a logical ‘strength’ to the published data. The middleware will then assure that exclusively the highest-strength data is delivered to the subscribing application.
- This implies an automatic fail-over to the next highest strength if something happens to the liveliness of the ‘highest-strength’ writer application
- Note that this mechanism relieves the dependency on ordering by source-timestamp
- This solution is not transparent to the application although a typical usage-pattern is to provide the to-be-used strength as an environment parameter to replicated applications so that the business-logic itself can be exactly the same between replica’s
- also here the data goes over the wire multiple times as the decision to select the highest-strength is done at the receiving side
by using ‘strength-aware’ writers
- this is a solution that preserves bandwidth by preventing data to be sent multiple times
- as DDS allows applications to query the so-called ‘built-in topics’ that capture information about all DDS-entities in the system, a writer can detect that its currently not the highest-strength writer (by consulting the DCPSPublication built-in topic for all relevant replicated writers) and can therefore refrain from actually publishing data
- this business-logic can be easily wrapped in a library to shield it from each individual application that utilizes it

Related