Distributed Erlang same node with different host names?

dantswain · April 18, 2021, 10:04pm

Hi! I was wondering if anyone knows if it’s possible to connect two nodes when the connection node host name differs from the configured node host name?

The context here is deployment in a docker cluster, but I can reproduce this locally without docker.

For example, suppose my local network IP is 192.168.1.186:

# in terminal 1
iex --name node1@192.168.1.186
# in terminal 2
iex --name node2@192.168.1.186

Then I can connect to node 2 from node 1 as expected if I use that IP:

iex(node1@192.168.1.186)3> Node.connect(:"node2@192.168.1.186")
true
iex(node1@192.168.1.186)4> Node.list()                         
[:"node2@192.168.1.186"]

However, if I try to connect using `127.0.0.1, it fails to connect:

iex(node1@192.168.1.186)3> Node.connect(:"node2@127.0.0.1")
false
iex(node1@192.168.1.186)4> Node.list()                         
[]

I’ve done a bit of digging on this using recon_trace on the dist_util module and it seems to fail to connect during the handshake step. I was thinking maybe the cookie value gets hashed with the node name somehow, but it doesn’t look like that’s the case.

If nothing else, I’m curious if anyone can explain what’s going on here just for my own education. I was able to work around this in my deployments, but I’d love to understand better.

Thanks!

hauleth · April 18, 2021, 10:13pm

It is simple - there is no such mode registered in the EPMD that is listening on given machine, so there is no node that you can connect to.

cmkarlsson · April 18, 2021, 10:38pm

This is often a source of confusion.
The node name is the full name. You can’t split it up in a name and hostname/ip. If you named your node node1@192.168.1.186 then you must use the full name to connect. You can’t replace 192.168.1.186 with another IP to the same host or to a hostname. I don’t know why it was designed this way.

If you must connect to a node using different IP addresses you need to setup DNS and have it resolve differently. And the DNS needs to be FQDN (which is not quite true, but it must have a dot in the name)

dantswain · April 18, 2021, 10:52pm

I don’t think it’s epmd, for a couple reasons. First, I can see the connection being initiated in the trace on both nodes, so it is at least attempting an actual connection to the node. Second, I have reproduced this experiment with epmd disabled (replaced with a different implementation) and gotten the same results. Possibly it’s not finding the node in epmd and then making some kind of broadcast attempt to connect?

I agree it seems that somewhere there is a check on the literal value of the node name. It seems to happen after an initial connection attempt is made - ie some part of the stack does split the host part of the name off from the whole thing, but the handshake appears to check for it. I was hoping someone could explain just for the sake of education as I couldn’t find anything in the docs that would explain.

Fwiw I did get my set up to work using a forked version of caravan GitHub - uberbrodt/caravan: epmd implementation and other OTP apps to make running Erlang/Elixir apps with Nomad and Consul easier.

dantswain · April 19, 2021, 1:14am

I redid my experiment with 3 nodes, again using tracing to see connection activity on all 3 nodes simultaneously. When node1 attempts to connect to node2, I see activity on node2 and not node3, and likewise trying to connect to node3 results in activity on node3 and not node2. So epmd does appear to be matching up the connection based on the first (before the @) part of the node name.

samfrench · August 6, 2021, 1:08pm

I have exactly the same question. Do you have a link to the forked version which solved it for you?

dantswain · August 6, 2021, 10:37pm

I’ll try to put something together. I was working with a private repo so it isn’t trivial to extract. It might take a couple days.