How to deal with message when cluster node crash?

max-vc · July 30, 2020, 4:40pm

Hi All,

I am building a instant messaging app with Phoenix backend.
I use unique user name as the channel name,so user can send msg to each other base on name.
Now I have two nodes running, and they connect each other.

The question is how can I know the message can not reach to other node, for example the target node crash.
and shall I check the user, connect to which node, and check if the node is online, before boradcast or send the message?

Thanks.

hubertlepicki · July 30, 2020, 7:11pm

It is very little info about what you are doing in terms of how your infrastructure and code is set up, but sending back acknowledged message generally seems like the way to go…

max-vc · July 31, 2020, 1:09am

Thanks for the suggestion.
yes, I shall user ACK flag to decide whether resend message when user connect back.

here is the code I broadcast user’s message,
the question is shall I check the user or user’node is online , before broadcast the message?

def handle_in("new_message", payload, socket) do
    %{"to" => to_username, "message" =>  conversation} = payload
    from_username = socket.assigns.username
    message = %{
      from:from_username,
      message: conversation
    }
    # shall I find channel dm:`to_username`'s node whether is online,
    # if not, store the message to database,and send again when user's node online again
    socket.endpoint.broadcast!("dm:#{to_username}", "new_message", message)
 end

hubertlepicki · July 31, 2020, 1:37pm

I think the main problem here is that you are using Phoenix.PubSub (endpoint.broadcast), which is precisely what it says it is: a pub-sub mechanism. The pub-sub mechanisms generally are fire and forget, and Phoenix does it precisely this way. So, if you want to listen to some ack message it is up to you for implementation and it won’t be super easy, as you have to handle timeouts, out-of-order messages etc.

What I would do instead is to maybe register process globally within a cluster and then perform a normal GenServer.call. GenServer already sets up monitor on the destination process, and will error if the pid you are trying to send message to crashes, it also waits for the response blocking so after your GenServer.call finishes, you will be sure the destination process did in fact receive message and didn’t crash processing it.

There is a number of ways to perform gluster-wide process registration, and maybe you’d need grouping them too. There’s pg2 and Horde.Registry but also :global and others. I would have a look at Horde as this is what I am using and works pretty much flawlessly in my, relatively simple, scenarios.

max-vc · July 31, 2020, 4:25pm

Thank you so much.
I am not aware Elixir has this concept: process registration. since your last reply, I am looking at Phoenix Tracker, try to find a way to register the user-process info when join the channel.
seems the Horde.Registry is the answer.
Sir, last question, as the IM app may has millions clients connected, shall I worry the performance when use Horde.Registery?