Node down even when app is running

tfwright · August 13, 2022, 5:32pm

I have a mysterious problem running my app on multiple nodes I hope someone can advise on.

A while back I set up my deployment script to build 2 separate releases of my app two serve behind nginx for load balancing/zero down time deployment purposes (with RELEASE_NAME set to something like “my_app” and “my_app2”). I am managing each app process as a systemd service. This all works great–after a deployment I can tail the logs of each server and observe them both serving requests, and use systemd to stop 1 app instance and observe that nginx is still responsive.

The problem is that when I go to connect to my_app using the remote command I get the error Could not contact remote node app@127.0.0.1, reason: :nodedown. Aborting... Connecting to my_app2 works fine.

I can fix this issue by running service my_app restart. After that, I can connect to both instances.

However, if I run service my_app2 restart the app node again appears to be down.

I am using libcluster to manage the nodes using the following config:

    [
      my_app: [
        strategy: Cluster.Strategy.LocalEpmd
      ]
    ]

my_app env.sh:

export RELEASE_DISTRIBUTION=name
export RELEASE_NODE=my_app@127.0.0.

my_app2 env.sh:

export RELEASE_DISTRIBUTION=name
export RELEASE_NODE=my_app2@127.0.0.

Thanks in advance for any clues or suggestions of things to try!

kokolegorille · August 13, 2022, 7:13pm

No RELEASE_COOKIE?

You should have one shared by all the nodes…

And provide it as well when using remote.

tfwright · August 13, 2022, 9:21pm

I set the RELEASE_COOKIE sys var when I built the releases and it appears to be set properly:

iex(my_app@127.0.0.1)1> (System.get_env()
...(my_app@127.0.0.1)1> |> Enum.map(fn {k, v} -> "#{k}=#{v}" end)
...(my_app@127.0.0.1)1> |> Enum.filter(&String.starts_with?(&1, "RELEASE_")))
["RELEASE_BOOT_SCRIPT_CLEAN=start_clean",
 "RELEASE_ROOT=/home/my_app/apps/my_app/releases/20220812151723/api_v2/_build/prod/rel/my_app",
 "RELEASE_SYS_CONFIG=/home/deployer/apps/my_app/releases/20220812151723/api_v2/_build/prod/rel/my_app/releases/0.1.0/sys",
 "RELEASE_VSN=0.1.0", "RELEASE_DISTRIBUTION=name",
 "RELEASE_COOKIE=[redacted]",
 "RELEASE_VM_ARGS=/home/my_app/apps/my_app/releases/20220812151723/api_v2/_build/prod/rel/my_app/releases/0.1.0/vm.args",
 "RELEASE_BOOT_SCRIPT=start",
 "RELEASE_TMP=/home/my_app/apps/my_app/releases/20220812151723/api_v2/_build/prod/rel/my_app/tmp",
 "RELEASE_COMMAND=start", "RELEASE_MODE=embedded", "RELEASE_NAME=my_app",
 "RELEASE_NODE=my_app@127.0.0.1"]

Note: the output is exactly the same in both nodes, which I can connect to after the restart I mentioned above, aside from the node name/paths.

After restarting the first instance, I can see both nodes:

sudo service  my_app restart
erts-11.1.8/bin/epmd -names
epmd: up and running on port X with data:
name my_app at port Y
name my_app2 at port Z

But after restarting the second instance, the first is gone?

sudo service my_app2 restart
erts-11.1.8/bin/epmd -names
epmd: up and running on port X with data:
name my_app2 at port Z

tfwright · May 19, 2025, 6:03pm

Forgot to update this, but I eventually discovered this was due to Quantum trying to run jobs on a connected node that itself is not running Quantum (see How to setup quantum against libcluster · Issue #485 · quantum-elixir/quantum-core · GitHub). Switching to the Local run strategy resolved the issue.