Generic replication framework architecture

Last modified by Thomas Mortagne on 2026/02/25 14:37

A message is given to the sender, which serializes it on disk (<data>/replication/sender/) before putting it in the send queue associated with each instance to which the message needs to be sent. The messages are picked from the queue and sent to each target instance through HTTP.

The received message is serialized on disk on target instance side; the serialized message is added to a handling memory queue, and an HTTP response is sent back to the sender instance. The messages are picked from the handling queue and provided to the receivers associated with each message type.

If the target instance responds that the message was stored successfully, it's removed from the sender disk.

If the target instance fails to serialize the message or is not reachable, the sender is going to retry the message after an increasing delay (retrying after 1s, then retrying after 2s, then 4s, etc. until reaching a 2h delay where it keeps trying every 2h until it works).
The flushing of a sending queue is triggered in the following cases:

  • The associated target instance is sending a ping (which happens when the instance starts)
  • A message is received from the associated target instance
  • The administration UI expose a button to force the flushing of waiting messages

If the sending instance crashes or is stopped before all the messages are fully sent, they are loaded from the disk during the next initialization and re-injected in the send queue.
If the receiving instance crashes or is stopped before all the received messages have been handled by receivers, they are loaded from the disk during the next initialization and re-injected in the receive queue.

Signature

When instances are linked, they exchange public keys. The private version of the key is stored on filesystem (<data>/replication/keys/).

When an instance sends a message, it also sign it so that the receiving instance can verify that the sender really is what it's claiming to be.

It's possible to reset an instance key from the administration (generally in case the key is suspected to have been compromised).

Since everything on disk is indexed in memory, you need to stop the instance to make any direct modifications to what's stored on <data>/replication/* (which, in general, is something discouraged).

Clustering

Replication fully supports the XWiki clustering use case, and you generally don't need to take care of anything.

The private keys stored on disk are duplicated for each cluster member.

However, the messages to send are specific to each member that produced them, and are never shared between cluster members. This means that if a member is down, its remaining messages to send will only be sent when the cluster member is back online.

In consequence the rare cases where you need to reset the Replication related data on disk, make sure to do it on all the cluster members.

Get Connected