Erlang binary not responding

Greetings.
I’m having this weird issue with distillery deployments and I’m wondering if anyone has encountered a similar problem.
I have a script that builds a release for given environment and it works perfectly on my development machine (ArchLinux).
I use one server for a Gitlab runner that runs a CI that runs tests, builds a release,pings it, then stops it (it says ok).
It then deploys the release to a “review” server.
Both servers use Ubuntu 16 and both releases work and can be pinged.

What happens is, if I try stopping the review release, Erlang will keep running and binary will be completely unresponsive (ping, start, stop, console, …).
Similarly, if I try starting the release on build server, nothing happens, even though it’s been pinged before.
Even if I kill the process (found with ps aux), nothing changes.

Seemingly, the binary stops responding once it’s ran. This, however, does not happen on the development machine which starts, stops and pings correctly.

Here’s my mix.exs:

defmodule Windfarm.Mixfile do
  use Mix.Project

  def project do
    [app: :windfarm,
     version: "0.0.1",
     elixir: "~> 1.5",
     elixirc_paths: elixirc_paths(Mix.env),
     compilers: [:phoenix, :gettext] ++ Mix.compilers,
     build_embedded: Mix.env == :prod,
     start_permanent: Mix.env == :prod,
     aliases: aliases(),
     deps: deps()]
  end

  def application do
    [mod: {Windfarm, []},
     applications: [:phoenix, :phoenix_html, :cowboy, :logger, :gettext,
                    :phoenix_ecto, :postgrex, :httpoison, :gen_mqtt]]
  end

  defp elixirc_paths(:test), do: ["lib", "web", "test/support"]
  defp elixirc_paths(_),     do: ["lib", "web"]

  defp deps do
    [
      {:phoenix, "~> 1.3"},
      {:postgrex, ">= 0.0.0"},
      {:phoenix_ecto, "~> 3.2"},
      {:phoenix_html, "~> 2.10"},
      {:phoenix_live_reload, "~> 1.1", only: :dev},
      {:gettext, "~> 0.9"},
      {:cowboy, "~> 1.1"},
      {:gen_mqtt, "~> 0.3.1"},
      {:httpoison, "~> 0.10.0"},
      {:distillery, "~> 1.4", runtime: false}
    ]
  end

  defp aliases do
    ["ecto.setup": ["ecto.create", "ecto.migrate", "run priv/repo/seeds.exs"],
     "ecto.reset": ["ecto.drop", "ecto.setup"]]
  end
end

This is my distillery config:

Path.join(["rel", "plugins", "*.exs"])
|> Path.wildcard()
|> Enum.map(&Code.eval_file(&1))

use Mix.Releases.Config,
    default_release: :default,
    default_environment: Mix.env()

environment :review do
  set include_erts: true
  set cookie: :"cookie"
  set vm_args: "rel/vm.args"
end

release :windfarm do
  set version: current_version(:windfarm)
  set applications: [
    :runtime_tools
  ]
end

This is my vm.args:

-name ${NODE_NAME}@127.0.0.1
-setcookie cookie

Where cookies are redacted and NODE_NAME is set in environment.

To me this implies that you have code in your application module’s stop or pre_stop callback which is blocking forever, or perhaps some path with timeouts/shutdown set to :infinity with the same problem. Does the bin/myapp stop command block indefinitely? Or does it return right away but the app stays up. If the latter, can you connect a remote shell to it afterwards with bin/myapp remote_shell?

1 Like

stop doesn’t seem to do nothing: no ok, nothing, returns instantly. After that, the app stops responding to every other command. I can only kill the process then.

If you send SIGUSR1 to the Erlang OS process, does it generate a crash dump? If yes you could open it with observer and try to find what’s stuck.

Yeah I would try the approach recommended by @dom. kill -s USR1 pid or kill -s SIGUSR1 pid depending on your platform. Can you also clarify what the exit status is of the bin/myapp stop command? You can check with echo "$?" right after running the command. This will tell us if the boot script is failing, or if it is succeeding and the runtime is locking up. It could be a combination of both, but right now it’s hard to see what might be going wrong.

With the process running, stopping returns 1 as its exit status.
This is also true for start and ping once the process is killed, i.e. it cannot be started anymore, not until release directory is deleted and recreated (from tar.gz-ed release).
So it cannot be stopped with stop and cannot be started anymore. If you try pinging it, it says nothing, returns 1.

Killing the process with SIGUSR1 signal doesn’t seem to create a crash dump, at least not anywhere in the release directory.

I was working with @svetigrah on this issue.

The problem was this line:
bldred=${txtbld}$(tput setaf 1) # Red

It can be found in the bin/myapp script.

In our server environment, there was $TERM env. variable with the value of vt220 causing tput setaf 1 to fail with the exit status 1.

This has been addressed on master recently, but has not yet been pushed in a release, just FYI

1 Like