Message Delivery Guarantees in a Distributed Setting?

What are the message delivery guarantees of Elixir when sending messages between two distributed Processes? I couldn’t find a clear answer but I think it is at-most-once, is this correct?

1 Like

Exactly once assuming that receiver is alive when message arrives to the remote node.

1 Like

Here’s a very relevant article.

Applicable section:

Sending messages to processes on other Erlang nodes works in the same way, albeit there’s now a risk of messages being lost in transit. Messages are guaranteed to be delivered as long as the distribution link between the nodes is active, but it gets tricky when the link goes down.

1 Like

I read that article before, but it remains a bit vague by saying “it gets tricky when the link goes down”. To me this sounds like it is possible that messages get lost when there is a disruption of the distribution link. Is that correct?

1 Like

Can the message get lost in transit? For example, being sent from one node but never arriving at the other node due to something going wrong in between.

1 Like

Regular TCP rules apply there. So if the message is lost it will be retried as per standard TCP rules.

3 Likes

It’s worth noting that send doesn’t make any guarantees that the pid you are sending a message to is alive. If you want acknowledgements you’ll need to wait to get an ack message back. The mechanics that @hauleth highlights are still relevant as far as the mechanics of distributed messaging, but even in a non distributed context just that send returned doesn’t mean that anybody handled your message.

3 Likes

Considering disconnection and messages being lost:

As you may have already found out, runtime does not return any information to the sender about whether message was sent or not. You don’t need distribution to see how it works

iex> pid = spawn fn -> raise "I am so dead" end
iex> # Wait for exception 
iex> send(pid, :anything)
:anything

So, like in any distributed system, sender must make sure that a message was received by awaiting on a response like:

iex> pid  = spawn fn -> receive do: ({x, caller} -> send(caller, :received)) end
iex> send(pid, {:something, self()})
iex> receive do
...>   :received -> :ok
...>   after 5_000 -> {:error, :no_response}
...> end

And this rule applies to literally any distributed system (even to parallel programs on a single machine). For further explanations on how this works check out GenServer.call documentation and gen_server:call implementation


Considering other guarantees

  1. send either sends one message, either sends nothing. There is no automatic retry and there is no check if the sender is even alive. However, it is guaranteed that one send will never result in two or more messages in message queue of the receiver

  2. send(:x) ; send(:y) if both messages are successfully sent, the :x will be the first in message queue of the receiver and the :y will be the second in queue

4 Likes