Exrm upgrade kills worker

I have been testing hot code reloading functionality of Erlang VM, using Elixir and making releases with exrm.

Here is the application module

defmodule RelTest do
  use Application

  # See http://elixir-lang.org/docs/stable/elixir/Application.html
  # for more information on OTP Applications
  def start(_type, _args) do
    import Supervisor.Spec, warn: false

    port = Application.get_env(:APP_NAME, :listen_port, 9000)
    {:ok, socket} = :gen_tcp.listen(port, [:binary, active: false, reuseaddr: true])

    # Define workers and child supervisors to be supervised
    children = [
      # Starts a worker by calling: RelTest.Worker.start_link(arg1, arg2, arg3)
      # worker(RelTest.Worker, [arg1, arg2, arg3]),
      worker(Task, [fn -> TestListener.start(socket) end])
    ]

    # See http://elixir-lang.org/docs/stable/elixir/Supervisor.html
    # for other strategies and supported options
    opts = [strategy: :one_for_one, name: RelTest.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

And here is the worker

defmodule TestListener do
  require Logger

  def start(socket) do
    {:ok, client} = :gen_tcp.accept(socket)
    Logger.info "A client connected"
    Task.async(fn -> loop(client) end)
    start(socket)
  end

  def loop(socket) do
    case :gen_tcp.recv(socket, 0) do
      {:ok, _} ->
    say_hello(socket)
    Logger.info "Said hello to client ;)"
    loop(socket)
      {:error, _} ->
    Logger.info "Oops, client had error :("
    :gen_tcp.close(socket)
    end
  end

  def say_hello(socket) do
    :ok = :gen_tcp.send(socket, <<"Hey there!\n">>)
  end

end

This is version 0.1.0. So I run these:

MIX_ENV=prod mix compile
Mix_ENV=prod mix release

and I get a nice release. I run it with ./rel/rel_test/bin/rel_test console and everything works. Now I’m going to bump the code and version, so here is the version 0.1.1 of listener:

defmodule TestListener do
  require Logger

  def start(socket) do
    {:ok, client} = :gen_tcp.accept(socket)
    Logger.info "A client connected"
    Task.async(fn -> loop(client) end)
    start(socket)
  end

  def loop(socket) do
    case :gen_tcp.recv(socket, 0) do
      {:ok, _} ->
    say_hello(socket)
    Logger.info "Said hello to client ;)"
    loop(socket)
      {:error, _} ->
    Logger.info "Oops, client had error :("
    :gen_tcp.close(socket)
    end
  end

  def say_hello(socket) do
    :ok = :gen_tcp.send(socket, <<"Hey there, next version!\n">>)
  end

end

Now I run

MIX_ENV=prod mix compile
Mix_ENV=prod mix release

and appup is created successfully, then to do hot upgrade

./rel/rel_test/bin/rel_test upgrade "0.1.1"

and the upgrade works, but it kills my listener after upgrade.

I tested with a nc localhost 9000 (9000 is the port of listener), staying connected and running upgrade command. Connection gets killed and I get a message in console:

=SUPERVISOR REPORT==== 31-Aug-2016::23:40:09 ===
     Supervisor: {local,'Elixir.RelTest.Supervisor'}
     Context:    child_terminated
     Reason:     killed
     Offender:   [{pid,<0.601.0>},
                  {id,'Elixir.Task'},
                  {mfargs,
                      {'Elixir.Task',start_link,
                          [#Fun<Elixir.RelTest.0.117418367>]}},
                  {restart_type,permanent},
                  {shutdown,5000},
                  {child_type,worker}]

So why this happens? Is it something I’m missing, or it is the expected behavior? Is it not the use case of hot code reloading?

I have read LYSE, but the author says the running code should keep running, only the external calls made after the upgrade are to be served with new version.

Maybe Elixir’s releases are not supposed to work like raw Erlang’s, but anyway in the end, both are running on BEAM, aren’t they?

Then why kill the worker?

P.S. Here is the SO question just in case.

1 Like

I am by no means an expert on hot code reloading but it may be related to the fact that you aren’t using proper OTP children. A task is not an OTP behaviour and is not a candidate that is safe to hot reload. Your looping code isn’t either.

@fishcakez would know more but it strikes me that this is relevant.

2 Likes

Likely your process is getting killed because the upgrade instructions define a brutal purge on post purge. That is kill all old process running old code once the new code has been loaded and then purge the old code. This means we enter a race condition where all process for that module must get on to the new code as soon as it is loaded - otherwise killed.

With a GenServer, or other OTP process, it is possible to use an update instruction that will suspend all the GenServers currently running the module and then resume them after loading. This should mean that none of the GenServers should be running the old code and do not get purged. Of course any new GenServers started during the upgrade may get killed.

However when not doing an update with an OTP callback module the post purge is likely to kill any processes running long running functions. It is possible to do a soft purge instead that will only purge the old code once processes stop using it. Instead of killing all process using it immediately and then purging. Also an update isn’t occur but just a load_module. The safest way to use this is to keep the external contract the same (keep the arguments and return the same types) and consider a soft_purge for the after purge if possible. This is exactly the type of loading you will do when using Phoenix/Plug on the non-OTP HTTP processes.

In the above example the functions do not have an opportunity to use a new version of the code so require soft_purge. However a second appup will kill the processes as a third version of the same module can not be loaded - the process calling :gen_tcp.accept will be using the initial version of the module. This can be avoided by calling __MODULE__.start(socket) and __MODULE__.loop(socket) as the latest version of the code will be loaded for external calls. Whereas start(socket) will use the current version and never a newer version. The contract is remaining the same as the argument is always a socket and return is eventually :ok/ignored.

After applying that simple change then appup would need an entry like:

{load_module, 'Elixir.TestListener', brutal_purge, soft_purge, []}

Note that it seems like you want Task.start_link and not Task.async as never await the result. This will cause the acceptor’s message box to grow as an :ok and :DOWN message will be sent for every connection and never received. Also it is common practise to have one module per process as different module loading schemes might be needed for different processes. For example the recv/send processes may become a GenServer and require update.

2 Likes

Pretty nice explanation. Thanks!

Just to be sure that I understood, here is the simple version:

exrm detects OTP behaviors when generating appups, and generates an appup that calls the code_change callback (or similar).
But when there is no OTP behaviors, it just does a brutal_purge and kills all old-code-running processes. To prevent this, one has to change appup himself/herself, change brutal_purge to soft_purge and makes sure that code is using external calls, i.e. __MODULE__.fun().

1 Like

Yes but to clarify, the soft_purge is for the post purge (there is also a pre purge) but you may require the brutal_kill to ensure no old code is running. Also it is the main loop function for a process where the external call might be needed. In most situations it is not a requirement.