Why does my release not announce itself to epmd?

I have a release produced by the new 1.9 mix release. The shell script generated for the release can start the service, but none of the other functions that depend on connecting to the running node, such as restart, stop, or pid work. All fail with --rpc-eval : RPC failed with reason :nodedown. When I query epmd with -names it shows no registered nodes.

I can see that the service was started with -sname my_app… so why would epmd not have this name listed?

1 Like

Please which system is this running under ? Windows ? Linux ? etc

How did you find out the service started with -sname my_app ?

It’s running on Linux – Ubuntu Server 18.04. I found that the -sname option was used by inspecting the command line arguments like this: cat /proc/$(pidof beam.smp)/cmdline | tr '\000' ' ' && echo.

I have restarted epmd and the service and now things are working properly, but the question remains: how did we get into this broken state?

Hi, I’m getting this :nodedown error, too.

My app was working on Google Computer with Ubuntu 16.04.

I’m switching to DigitalOcean with Ubuntu 18.04 and seeing this :nodedown error.

Have you learned anything since this post?
Thanks,
David

We encounter this regularly in our prod and qa environments. I’m convinced it’s a bug, though I haven’t had the time to pin down the exact cause.

It’s lazy, but we avoid the problem by changing the deployment procedure:

  1. scp myapp_0.1.1.tar.gz server:~/deploy_dir/
  2. ssh server
  3. cd ~/deploy_dir
  4. ./bin/myapp stop
  5. tar xzf myapp_0.1.1.tar.gz
  6. ./bin/myapp daemon

So, in a nutshell, stop the running version first, unpack the new version, start the new version. We don’t depend on hot code loading, so this solution is considered good enough. :unamused:

2 Likes

I am seeing this exact same behavior. If I run the release locally I am able to start/stop the app. epmd reports the same name both locally and on my production Debian server but I am only able to start the app on the Debian server and get the :nodedown error when I try to stop it. Any ideas on how to troubleshoot this?

I am also experiencing the same problem in a Debian server on GCP. The solution offered by @barndon above does not work for me since trying to stop the server first fails and the CI exists with a failure

I got a nice explanation as to why this might be happening on this blog

Summary: It is caused by cookie mismatch

Solution (straight from the blog):
The first thing we need to resolve is to ensure that every time we start our release, the same cookie is used. Fortunately, this can be easily done by using RELEASE_COOKIE environment variable or putting the cookie in our release configuration in mix.exs:

def project do
  [
    app: :app_name,
    ...
    releases: [
      app_name: [
        cookie: "<YOUR COOKIE>",
        steps: [:assemble, :tar]
      ]
    ]
  ]
end

The documentation recommend to use a long and randomly generated string for your cookie, which can be generated using the following code:

Base.url_encode64(:crypto.strong_rand_bytes(40))

I don’t want to necro a thread but this IS the number one google search for this problem at least for me and I feel this solution could help others.

For us the solution to this problem was the following: we were deriving the IP for the long name using $(hostname -I) in our env.sh.eex file, for some reason the return from this hostname command added a single space after the ip address which confused most of the commands (rpc | stop | pid | remote etc.).

Trimming whitespace when building the node name fixed this issue, so instead of node@127.0.0.1 <- space here we got node@127.0.0.1 and the scripts started working as expected.

This whitespace was stupidly difficult to actually see it turns out - you can test if you have this problem by adding an echo "$RELEASE_NODE|" to the bootstrapping script under one of the commands (rpc/pid or whatever) and you’ll see the space before the pipe if you’re suffering from trailing whitespace node@127.0.0.1 |

3 Likes