Generic replication framework architecture
A message is given to the sender, which serializes it on disk (<data>/replication/sender/) before putting it in the send queue associated with each instance to which the message needs to be sent. The messages are picked from the queue and sent to each target instance through HTTP.
The received message is serialized on disk on target instance side; the serialized message is added to a handling memory queue, and an HTTP response is sent back to the sender instance. The messages are picked from the handling queue and provided to the receivers associated with each message type.
If the target instance responds that the message was stored successfully, it's removed from the sender disk.
If the target instance fails to serialize the message or is not reachable, the sender is going to retry the message after an increasing delay (retrying after 1s, then retrying after 2s, then 4s, etc. until reaching a 2h delay where it keeps trying every 2h until it works).
The flushing of a sending queue is triggered in the following cases:
- The associated target instance is sending a ping (which happens when the instance starts)
- A message is received from the associated target instance
- The administration UI expose a button to force the flushing of waiting messages
If the sending instance crashes or is stopped before all the messages are fully sent, they are loaded from the disk during the next initialization and re-injected in the send queue.
If the receiving instance crashes or is stopped before all the received messages have been handled by receivers, they are loaded from the disk during the next initialization and re-injected in the receive queue.
Signature
When instances are linked, they exchange public keys. The private version of the key is stored on filesystem (<data>/replication/keys/).
When an instance sends a message, it also sign it so that the receiving instance can verify that the sender really is what it's claiming to be.
It's possible to reset an instance key from the administration (generally in case the key is suspected to have been compromised).
Since everything on disk is indexed in memory, you need to stop the instance to make any direct modifications to what's stored on <data>/replication/* (which, in general, is something discouraged).
Clustering
Replication fully supports the XWiki clustering use case, and you generally don't need to take care of anything.
The private keys stored on disk are duplicated for each cluster member.
However, the messages to send are specific to each member that produced them, and are never shared between cluster members. This means that if a member is down, its remaining messages to send will only be sent when the cluster member is back online.
In consequence the rare cases where you need to reset the Replication related data on disk, make sure to do it on all the cluster members.
Troubleshooting
Messages are not sent
When and instance cannot send messages to another instance, it's possible to access details about the error (mainly the HTTP response of the latest message sending try) in the administration UI. It generally provide enough information to understand why it could not reach the target instance. If the problem is not a network problem but the target instance failed to store the message, you should find details about the reason in the log of this instance (not space left on disk, invalid message, bug).
Fully reset Replication setup
Replication related data are stored in the following locations:
- Disk: the private key (to sign messages), the remaining messages to send and the not yet handled received messages are stored on disk in <data>/replication
- Instances linking: all metadata related to others known replication instances is stored on the page XWiki.Replication.Instances
- Document state: metadata related to the state of replicated documents (its owner instance, its readonly and conflict statuses) is located in the database in the table replication_document
- Standard replication configuration: the configuration that indicates where exactly documents are replicated (when the standard controller is used) is located in the database in the table replication_entity_instances