Supervised GenServer process disappeared and wasn't restarted

Greetings!

Tonight we had an incident in production when a somewhat important GenServer process seemingly disappeared and had not been restarted by a supervisor. This is the first incident in 2.5 years since this worker had been introduced.

The GenServer in question has basically the following structure:

defmodule MyWorker do
  use GenServer, restart: :transient

  def start_link(opts) do
    GenServer.start_link(__MODULE__, opts, name: __MODULE__)
  end

  @impl GenServer
  def init(opts) do
    Process.send_after(self(), :do_work, 500)

    {:ok, opts}
  end

  @impl GenServer
  def handle_info(:do_work, state) do
    do_some_work(state)
    Process.send_after(self(), :do_work, 500)

    {:noreply, state}
  end

  defp do_some_work(state) do
     # query DB
     # send HTTP request
     # that's all - no OTP messaging, exits, etc.
  end
end

This GenServer is started as part of a Supervisor (let’s call it MySupervisor). MySupervisor uses the :one_for_one strategy, all other init options are default.

When I connected to the remote console, I found out that the process is not alive - Process.whereis(MyWorker) returned nil, Supervisor.which_children(MySupervisor) returned the following entry for MyWorker:

{MyWorker, :undefined, :worker, [MyWorker]}

As far as I understand, this means MyWorker’s child spec was known to MySupervisor, but the worker process itself was not running. I restarted MyWorker with Supervisor.restart_child/2 and started figuring out what went wrong.

The obvious culprit is restart: :transient - as MyWorker was intended to work indefinitely and to never successfully terminate, it should have been :permanent. If we assume MyWorker terminated normally, it explains why it wasn’t restarted.

The problem is figuring out why MyWorker died, because it neither stops itself, nor it is explicitly stopped by some other app code. I have 2 hypotheses left:

  • some developer connected to the remote console and stopped the worker by hand (very unlikely);
  • the worker terminated with an error and was not restarted due to a bug in Supervisor (extremely unlikely).

Am I missing anything else?

1 Like

Hi :slight_smile: Welcome back!

Is there exit(:normal) or exit({:shutdown, some_value}) somewhere in your code that could have been called by that process?

Or maybe it was linked to another process that exits with shutdown? I don’t remember how GenServer behaves in that case.

2 Likes

Thanks for the reply, I think you nailed the two most possible cases that I haven’t though of!

No, there’re no explicit exits that can be called from this process (unless I missed something). I also doubt that Ecto and HTTPoison have such exits in their codebases that could leak into user’s code.

As for how GenServer behaves in this case, I think this part of Process.link/1 docs is relevant:

When two processes are linked, each one receives exit signals from the other (see also exit/2). Let’s assume pid1 and pid2 are linked. If pid2 exits with a reason other than :normal (which is also the exit reason used when a process finishes its job) and pid1 is not trapping exits (see flag/2), then pid1 will exit with the same reason as pid2 and in turn emit an exit signal to all its other linked processes. The behavior when pid1 is trapping exits is described in exit/2.

So if my understanding is correct, if MyWorker was linked to another process, and that process exited with a :shutdown (or {:shutdown, value}), then it would indeed cause MyWorker to terminate normally as well. Unfortunately, MyWorker isn’t linked to any other process.

2 Likes

There are two things to take into consideration as well:

â‘  Restart strategies, specifically :max_restarts - the maximum number of restarts allowed in a time frame. Defaults to 3 and :max_seconds - the time frame in which :max_restarts applies. Defaults to 5. If something went south and there were 4 attempts to restart it within 5 seconds interval (defaults,) all failing, the GenServer would end up not restarted. This is likely your case, if e. g. the DB connection has dropped for 3 seconds. The culprit would be 500ms interval in init/1, which makes 4 consequtive insuccessful attempts to execute do_some_work/1 to fit into 5 secs and shutdown the process forever.

② The mailbox overflow. This technically might be your case, but extremely unlikely, just saying for the complete picture.

6 Likes

I was under the impression that if a supervised process restarts more than :max_restarts in a :max_seconds period, it would lead to the restart of the whole supervisor (and not to the supervised process to end up not being restarted). Is it not the case? I didn’t find specific mentions of this behaviour in Elixir’s Supervisor docs, but Erlang docs say (emphasis mine):

Restart intensity and period

To prevent a supervisor from getting into an infinite loop of child process terminations and restarts, a maximum restart intensity is defined using two integer values specified with keys intensity and period in the above map. Assuming the values MaxR for intensity and MaxT for period, then, if more than MaxR restarts occur within MaxT seconds, the supervisor terminates all child processes and then itself. The termination reason for the supervisor itself in that case will be shutdown. intensity defaults to 1 and period defaults to 5.

1 Like

I have no access to iex atm, but it could be easily validated with 10 LoCs.

Once you mentioned this, I recall the same experience: the sup terminates, but then it comes to Application and I doubt Application would terminate itself as well. This is the dark spot anyway, because too much of internal VM machinery would be involved and one God knows how 3 and 5 are calculated in the real world.

I just wanted to mention it, because in my experience if you don’t really expect an infinite loop there, it’d be safer to increase :max_restarts parameter and forget about this potential issue.

2 Likes

The app will terminate if the root pid exits, and immediatelly on the first exit, no restarts. And if the app is :permanent it’ll make the whole vm stop.

6 Likes

I have no access to iex atm, but it could be easily validated with 10 LoCs.

Good call! It seems like exceeding :max_restarts restarts in a :max_period window does lead to the restart of the supervisor:

defmodule TestSup.GoodWorker do
  use GenServer

  require Logger

  def start_link(opts), do: GenServer.start_link(__MODULE__, opts, name: __MODULE__)

  def init(opts) do
    Logger.info("#{inspect(__MODULE__)}::#{inspect(__ENV__.function)}")

    {:ok, opts}
  end
end

defmodule TestSup.BadWorker do
  use GenServer, restart: :transient

  require Logger

  def start_link(opts), do: GenServer.start_link(__MODULE__, opts, name: __MODULE__)

  def init(opts) do
    Logger.info("#{inspect(__MODULE__)}::#{inspect(__ENV__.function)}")

    Process.send_after(self(), :do_work, 500)
    {:ok, opts}
  end

  def handle_info(:do_work, state) do
    Logger.warning("#{inspect(__MODULE__)}::#{inspect(__ENV__.function)}: about to die")

    raise "error"

    {:noreply, state}
  end
end

defmodule TestSup.Supervisor do
  use Supervisor

  require Logger

  def start_link(opts), do: Supervisor.start_link(__MODULE__, opts, name: __MODULE__)

  def init(_init_arg) do
    Logger.info("#{inspect(__MODULE__)}::#{inspect(__ENV__.function)}")

    children = [
      TestSup.GoodWorker,
      TestSup.BadWorker
    ]

    Supervisor.init(children, strategy: :one_for_one, max_restarts: 3, max_seconds: 5)
  end
end

defmodule TestSup.Application do
  use Application

  require Logger

  def start(_type, _args) do
    Logger.info("#{inspect(__MODULE__)}::#{inspect(__ENV__.function)}")

    children = [
      TestSup.Supervisor
    ]

    # See https://hexdocs.pm/elixir/Supervisor.html
    # for other strategies and supported options
    opts = [strategy: :one_for_one, name: TestSup.AppSupervisor]
    Supervisor.start_link(children, opts)
  end
end

Running this code will lead to an endless stream of errors and log messages of GenServers and Supervisor restarts:

$ mix run --no-halt
Compiling 1 file (.ex)
Generated test_sup app

20:29:51.683 [info] TestSup.Application::{:start, 2}

20:29:51.685 [info] TestSup.Supervisor::{:init, 1}

20:29:51.685 [info] TestSup.GoodWorker::{:init, 1}

20:29:51.685 [info] TestSup.BadWorker::{:init, 1}

20:29:52.186 [warning] TestSup.BadWorker::{:handle_info, 2}: about to die

20:29:52.188 [error] GenServer TestSup.BadWorker terminating
** (RuntimeError) error
    (test_sup 0.1.0) lib/test_sup/application.ex:32: TestSup.BadWorker.handle_info/2
    (stdlib 6.2.2.2) gen_server.erl:2345: :gen_server.try_handle_info/3
    (stdlib 6.2.2.2) gen_server.erl:2433: :gen_server.handle_msg/6
    (stdlib 6.2.2.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Last message: :do_work
State: []

20:29:52.192 [info] TestSup.BadWorker::{:init, 1}

20:29:52.693 [warning] TestSup.BadWorker::{:handle_info, 2}: about to die

20:29:52.693 [error] GenServer TestSup.BadWorker terminating
** (RuntimeError) error
    (test_sup 0.1.0) lib/test_sup/application.ex:32: TestSup.BadWorker.handle_info/2
    (stdlib 6.2.2.2) gen_server.erl:2345: :gen_server.try_handle_info/3
    (stdlib 6.2.2.2) gen_server.erl:2433: :gen_server.handle_msg/6
    (stdlib 6.2.2.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Last message: :do_work
State: []

20:29:52.693 [info] TestSup.BadWorker::{:init, 1}

20:29:53.194 [warning] TestSup.BadWorker::{:handle_info, 2}: about to die

20:29:53.194 [error] GenServer TestSup.BadWorker terminating
** (RuntimeError) error
    (test_sup 0.1.0) lib/test_sup/application.ex:32: TestSup.BadWorker.handle_info/2
    (stdlib 6.2.2.2) gen_server.erl:2345: :gen_server.try_handle_info/3
    (stdlib 6.2.2.2) gen_server.erl:2433: :gen_server.handle_msg/6
    (stdlib 6.2.2.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Last message: :do_work
State: []

20:29:53.194 [info] TestSup.BadWorker::{:init, 1}

20:29:53.695 [warning] TestSup.BadWorker::{:handle_info, 2}: about to die

20:29:53.695 [error] GenServer TestSup.BadWorker terminating
** (RuntimeError) error
    (test_sup 0.1.0) lib/test_sup/application.ex:32: TestSup.BadWorker.handle_info/2
    (stdlib 6.2.2.2) gen_server.erl:2345: :gen_server.try_handle_info/3
    (stdlib 6.2.2.2) gen_server.erl:2433: :gen_server.handle_msg/6
    (stdlib 6.2.2.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Last message: :do_work
State: []

20:29:53.695 [info] TestSup.Supervisor::{:init, 1}

20:29:53.695 [info] TestSup.GoodWorker::{:init, 1}

20:29:53.695 [info] TestSup.BadWorker::{:init, 1}

20:29:54.196 [warning] TestSup.BadWorker::{:handle_info, 2}: about to die

20:29:54.196 [error] GenServer TestSup.BadWorker terminating
** (RuntimeError) error
    (test_sup 0.1.0) lib/test_sup/application.ex:32: TestSup.BadWorker.handle_info/2
    (stdlib 6.2.2.2) gen_server.erl:2345: :gen_server.try_handle_info/3
    (stdlib 6.2.2.2) gen_server.erl:2433: :gen_server.handle_msg/6
    (stdlib 6.2.2.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Last message: :do_work
State: []

[...repeated...]

You can see that TestSup.Supervisor has been restarted.

If you add max_restarts: 1 to opts in TestSup.Application, you’ll see it will eventually lead to the shutdown of the application:

20:34:07.870 [notice] Application test_sup exited: shutdown

I just wanted to mention it, because in my experience if you don’t really expect an infinite loop there, it’d be safer to increase :max_restarts parameter and forget about this potential issue.

Yeah, it’s a solid advice which I will implement (combined with changing the :restart type of MyWorker to :permanent), I just wanted to get to the bottom of the issue, if possible.

1 Like

Yes the supervisor will exit if it has to restart a child too many times. And it will be restarted by its own supervisor.

You can have a chain of Sup → Sup → Sup → Sup → Worker and all elements of the chain will behave like this.

A good way to think about this is that the code describing the supervisor tree (each supervisor children, and the children description in each child supervisor, and so on) describes “how the app state should be at runtime”. And the system will try its best to maintain that state, or otherwise fail and shutdown totally.

Of course that description can be altered at runtime by calling start_child, delete_child to add/remove parts of the tree, and of course having :transient, :temporary or :significant children.

But there is no way it will skip some parts of the tree unless you tell it to. Max-restarts and al’ will only have a temporary impact on the state of the system, but the app will always get back to the desired state.

So if my understanding is correct, if MyWorker was linked to another process, and that process exited with a :shutdown (or {:shutdown, value}), then it would indeed cause MyWorker to terminate normally as well. Unfortunately, MyWorker isn’t linked to any other process.

If the child spec for MyWorker has restart: :transient then yes! (:shutdown is not “normally” but the supervisor will not restart it indeed).

Any progress on finding the cause?

  • Another possibility is that the child spec is overriden somewhere and :transient is replaced with :temporary ; but I guess this is unlikely.
  • Are there any calls to Supervisor.terminate_child or GenServer.stop in your codebase?
  • Are you using :significant with :any_significant or :all_significant?
3 Likes

Are there any logs available that could help identify why the process shut down?

I might be wrong, but I believe that if the process terminated multiple times in quick succession, the supervisor should have eventually shut down as well, which would have brought the entire VM down. However, that doesn’t seem to be the case here — the supervisor somehow stayed up but stopped restarting the process.

It’s also possible that someone accidentally connected to the VM and manually stopped the process.

1 Like

Unfortunately, no progress so far. Also, “no” to all of your points here.

For now we decided to change the restart type to :permanent and monitor this issue closely. I will follow up with more information if it arrives.

We didn’t find any suspicious entries in the logs (we know the approximate time the issue occurred from the application metrics). There were also no relevant errors reported in our Sentry instance in that time frame.

It’s possible, but very unlikely. There are maybe 2 people in our company other than me with the necessary knowledge required to gracefully terminate a supervised process, and the issue happened in the middle of the night. We also asked our Kubernetes admins to check if anyone accessed the pods in that time frame, just to be sure.