Why OpenSplice does not allow QoS to change

There are a number of technical reasons that OpenSplice does not allow QoS changes even though the OMG specification does indicate that some should be allowed. This article covers this reasoning and why QoS changes can lead to inconsistencies if allowed.

Difficulties

The first difficulty is that the specification does not define whether an individual sample is associated with a specific QoS. We take the position that when writing a sample, it is the QoS of the writer/publisher set at the time of the write that applies to that sample. If a QoS is changed afterward, that change should not affect the interpretation of that sample.

The second difficulty is that, in the DDSI protocol, discovery, and thus QoS changes, are asynchronous. That means you cannot trust the subscribers to have the same view of the currently applicable QoS unless you:

stop the outgoing data stream until everything has been acknowledged;
distribute the QoS update;
wait for the QoS update to have been processed
continue the outgoing data stream.

In a real-time environment, this is not a good strategy. Moreover, the specification doesn’t contain any statement that allows one to implement (3) in an interoperable manner.

Alternatives?

An alternative appears to be what the DDSI-spec calls “inline QoS”. In principle, it is possible to include all changed QoS settings in all samples (can’t stop because of (3) above). Here the first problem is that the specification doesn’t guarantee that the “inline QoS” overrules whatever has been discovered, it merely states “Contains QoS that may affect the interpretation of the message” — which is vague enough that it is useless.

So that leaves no interoperable mechanism to have “precise” QoS updates (“precise” in the same sense as in exception handling: it applies exactly from the first sample written after the QoS change). Without such precise QoS updates, two readers may receive the QoS change when reading different samples, so a sample could be interpreted differently and end up in a different state. This violates the core “eventually consistent” model of DDS.

Matching QoS

All of this applies to QoS that do not affect the rules for matching (such as the “ownership strength” setting used with the “exclusive ownership” mode). It doubly applies when changing QoS that do affect matching of readers and writers, such as the partition QoS. When the set of readers matched to a writer changes, the set of addresses to which the data needs to be sent may also change.

For example, if a writer that is initially matched with only reader A were to then change partition so that it only matches reader B, that writer will first send data to the IP address where reader A resides, then send data to the IP address where reader B resides. But what if the last sample written before the change happened to be lost on the network? Then it is too late for reader A to find out and get a retransmit (It may be possible to keep historical discovery data around in the writer, but the cost of that in terms of complexity is very high).

Not only does this mean an undetectable lost sample, if we add a third reader to the scenario that was matched initially and remains matched after the hypothetical partition change, it may be that this third reader does get the data that reader A should have received but never did. This is arguably worse than a lost sample, as it is an inconsistency in the system.

In another scenario, imagine a case where the writer switches back-and-forth between two partitions (and so where reader A is matched some of the time, and reader B the rest of the time). Then the readers will observe sample loss, and request retransmits. The reader requesting a retransmit is a matched reader, so if the data is still available, should the request be honoured? The specification suggests yes, but the data was never intended for that reader. This could be fixed by tracking the sequence number published by the writer at the time the reader was matched, but the complications are multiplying.

Transient Local QoS and Late Joiners

There are worse scenarios. The above is the type of problem that occurs with “volatile” data. If instead one considers “transient-local” data, that is, data held onto by the writer so it can provide some context to a late-joining reader, one has to concern oneself with the question whether a reader that is newly matched (as a consequence of a partition change) should receive the data published prior to the partition change. The answer, in our view, is not a simple “yes” or “no”, or at least not if the partition changes multiple times. Then the simple solution of tracking the sequence number at the time of the match no longer applies, and instead one must at the very least check the full QoS at the time of the write against the QoS of the reader.

This then begs the question: what if the reader changes its QoS as well? So now in deciding whether to honour a request for a retransmit arguably becomes dependent on the QoS that the reader had in the past as well as the one that the writer had in the past. Keeping track of this could quickly go out of control.

Another matter is that for the purpose of providing the correct historical data set to a late-joining reader, one has to account for possible QoS changes in the past. In practice, this means that one would have to request historical data from all writers that may have matched in the past, even if they do not match now. If one doesn’t do this, the data set between a reader that has been present from the beginning and a data reader that joined later will not be consistent, a violation of the core principle of DDS, but if one does this one violates the DDSI specification by matching endpoints that have mismatching QoS settings (with all ensuing interoperability problems).

DDS Security

Finally, if one looks at the DDS Security spec, it explicitly rules out QoS changes that affect matching for non-volatile readers and writers. That means enabling DDS Security can no longer be set on such a system without a rewrite. That’s not an attractive proposition.

Conclusion

Clearly, it is possible to implement it correctly for volatile data under some assumptions on how the processing of discovery data behaves. Without being careful about these assumptions (and indeed, removing them) what exactly happens a partition changes is unpredictable. That is, for us, sufficient reason not to support it.