SKEDSOFT

Real Time Systems

Requirements: The architectural requirements for the communication infrastructure of a distributed real-time system follow from the discussion about the properties of real-time data elaborated in the previous chapters. These requirements are substantially different from the requirements of non-real-time communication services.

Timeliness: The most important difference between a real-time communication system and a non-real-time communication system is the requirement for short message-transport latency and minimal jitter.

Short Message-Transport Latency. The real-time duration of a distributed realtime transaction, starting with the reading of sensors and terminating
with the output of the results to an actuator depends on the time needed for the computations within the components and the time needed for the message transport among the involved components. This duration should be as small as possible, such that the dead time in control loops is minimized. It follows that the worst-case message transport latency of a real-time protocol should be small. Minimal Jitter. The jitter is the difference between the worst-case message-transport latency and the best-case message-transport latency. A large jitter has a negative effect on the duration of the action delay  and the precision of the clock-synchronization.

Clock Synchronization. A real-time image must be temporally accurate at the instant of use. In a distributed system, the temporal accuracy can only be checked if the duration between the instant of observation of an RT-entity, observed by the sensor node, and the instant of use, determined by the actuator node, can be measured. This requires the availability of a global time base of proper precision among all involved nodes. It is up to the communication system to establish such a global time and to synchronize the nodes, e.g., by following the IEEE 1588 standard for clock synchronization. If fault tolerance is required, two independent self-checking channels must be provided to link an end system to the fault-tolerant communication infrastructure. The clock synchronization messages must be provided on both channels in order to tolerate the loss of any one of them.

Dependability:

Communication Reliability. In real-time communication, the use of robust channel encoding, the use of error-correcting codes for forward error correction, or the  deployment of diffusion based algorithms, where replicated copies of a message are sent on diverse channels (e.g., frequency hopping in wireless systems), possibly at different times, are the techniques of choice for improving the communication reliability. In many non-real-time communication systems, reliability is achieved by time redundancy, i.e., a lost message is retransmitted. This tradeoff between time and reliability increases the jitter significantly. This tradeoff should not be part of the basic message transport service (BMTS), since it is up to the application to decide if this tradeoff is desired.

Example: In the positive acknowledgment-or-retransmission (PAR) protocol, widely used in event-triggered non-real-time communication, a sender waits for a given time until it has received a positive acknowledgement message from the receiver indicating that the previous message has arrived correctly. In case the timeout elapses before the acknowledgement message arrives at the sender, the original message is retransmitted. This procedure is repeated n-times (protocol specific) before a permanent failure of the communication is reported to the sender. The jitter of the PAR protocol is substantial, since in most cases the first try will be successful, while in a few cases the message will arrive after n times the timeout value plus the worst-case message transport latency. Since the timeout value must be longer than two worst-case message transport latencies (one for the original message and one for the acknowledgment message), the jitter of PAR is longer than (2n) worst-case message-transport latencies.

Temporal Fault Containment of Components. It is impossible to maintain the communication among the correct components using a shared communication
channel if the temporal errors caused by a faulty component are not contained. A shared communication channel must erect temporal firewalls that contain the
temporal faults of a component (a babbling idiot), so that the communication among the components that are not directly affected by the faulty component is not compromised. This requires that the communication system holds information about the intended (permitted) temporal behavior of a component and can disconnect a component that violates its temporal specification. If this requirement is not met, a faulty component can block the communication among the correct components.

Example: A faulty component that sends continuously high-priority messages on a CAN bus will block the communication among all other correct components and thus cause a total loss of communication among the correct components.

Error Detection. A message is an atomic unit that either arrives correctly or not at all. To detect if a message has been corrupted during transport, every message is required to contain a CRC field of redundant information so the receiver can validate the correctness of the data field. In a real-time system, the detection of a corrupted message or of message loss by the receiver is of particular concern.

Example: Error detection on output. Consider a node at a control valve that receives output commands from a controller node. In case the communication is interrupted because the wires are cut, the control valve, the receiver, should enter a safe state, e.g., close the valve autonomously. The receiver, i.e., the control valve, must detect the loss of communication autonomously in order to be able to enter the safe state despite the fact that the wire has been cut. The failure of a component of a distributed system should be detected by the communication protocol and should be reported consistently to all remaining correct components of the ensemble. In real-time systems, the prompt and consistent detection of component failures is the function of a membership service.

End-to-End Acknowledgment. End-to-end acknowledgement about the success or failure of a distributed action is needed in any scenario where multiple nodes cooperate to achieve a desired result [Sal84]. In a real-time system, the definitive end-to-end acknowledgment about the ultimate success or failure of a communication action can come from a component that is different from the receiver of an outgoing message. An outgoing message to an actuator in the environment must cause some intended physical effect in the environment. A sensor component that is different from the actuator component monitors this intended physical effect. The result observed by this sensor component is the definite end-to-end acknowledgement of the outgoing message and the intended physical action.

Example: Figure 7.1 shows an example of an end-to-end acknowledgment of the output message to a control valve by a flow sensor that is connected to a different node.

Determinism. The behavior of the basic message transport service (BMTS) should be deterministic such that the order of messages is the same on all channels and the instants of message arrival of replicated messages that travel on redundant independent channels are close together. This desired property, which has been discussed at length in Sect. 5.6, is required for the implementation of fault tolerance by active redundancy.
Example: If in a fault-tolerant configuration the message order on two independent communication channels is not the same, then the fault-masking capability may be lost due to the missing replica determinism.