What's the pitfall of calling Task.Supervisor.start_link by hand?

In the Task.Supervisor docs, it states

You can also start it by calling start_link/1 directly … But this is recommended only for scripting and should be avoided in production code. Generally speaking, processes should always be started inside supervision trees.

However, this is not elaborated on.

In an application I’m writing, my supervision tree contains some number of instances of a single GenServer. This GenServer will spawn several tasks during its lifetime whose return value I care about; they are not fire and forget. As part of the GenServer’s state, I store the pid of a Task.Supervisor that I start in GenServer.init/1. In my mind, this makes sense, because if theTask.Supervisor crashes (due to too many task failures or what have you), my GenServer crashes and its supervisor restarts it.

If, however, I were to run the Task.Supervisor and the GenServer under the same supervision tree, I foresee race conditions where the GenServer is trying to spawn a task on a currently restarting Task.Supervisor (e.g. imagine that the Task.Supervisor has crashed, and before its supervisor restarts it, the GenServer attempts to spawn a task… this will surely fail). I can imagine this becoming exacerbated as more of the GenServers send tasks to this supervisor, as it will be even easier to trip the max_restarts of the Task.Supervisor,

Why do the docs say to prefer this approach?

1 Like

Great question.

So, one of the reasons the docs mention that the supervised approach is preferred is shutdown. When a supervisor gets stopped (for example, by your application shutting down gracefully), it gives children a shutdown period in which they can shutdown gracefully. The shutdown happens by following the supervision tree, with the leaves shutting down first and then going up the tree.

If a GenServer under a supervisor starts a Task.Supervisor with start_link/1, then the main supervisor doesn’t know about the Task.Supervisor. When you’ll shut down the main supervisor (for example, restarting your application for a deploy), then it will shut down your GenServer gracefully, but the GenServer is not a supervisor, so it won’t wait for the Task.Supervisor to shut down gracefully either. This means that the tasks under that supervisor might not shut down gracefully, which can be an issue.

The simplest option is considering whether you do need one Task.Supervisor per GenServer. You might go around this by having a global Task.Supervisor (or a pool of them, with PartitionSupervisor maybe), and using Task.Supervisor.async/3 to spawn tasks that are still linked to the GenServer. This should keep the reciprocal crash semantics.

Alternatively, you can have a supervision structure for each GenServer that looks like this:

[ParentSupervisor (Supervisor)] ← uses strategy :rest_for_one
          /            \ 
[Task.Supervisor]  [GenServer]

With this setup and using :rest_for_one, you achieve this: if the Task.Supervisor crashes, then GenServer is stopped too (since it appears after the task supervisor in the ParentSupervisor), so you don’t have to worry about the GenServer trying to spawn tasks on a dead task supervisor.

If the GenServer crashes, the Task.Supervisor won’t need to crash. If you started your tasks with async/3, they’ll be brought down anyways. If this isn’t good, you can always change the strategy to :one_for_all.

The only issue with this approach is that the GenServer doesn’t know “how to reach” the Task.Supervisor. A practical solution to this is to do something like generate a random term (a ref for example) in the ParentSupervisor:init/1 callback, and pass it down to both children, which use it to register themselves in a global Registry or something like that.

10 Likes

Thank you so much! This makes total sense to me

1 Like

I thought a bit about this specific point, and after looking at the implementation of Task.Supervisor, I don’t understand how this can happen.

Since the Task.Supervisor (or to be precise its implementation, DynamicSupervisor) is trapping exits, and it’s itself a GenServer, its terminate callback will be called whenever the parent process dies. The terminate callback shuts down the tasks. So a graceful shutdown will take place in this case, and the parent GenServer doesn’t need to wait for the TaskSupervisor, or even be aware of its existence. In fact, this is the same callback that would be triggered if the Task.Supervisor were started under a supervisor, and the parent supervisor sent a :shutdown signal.

The only situation I can think of where a graceful shutdown won’t take place is when the parent GenServer “brutally kills” the Task.Supervisor with an explicit :kill signal.

The drawback I see here of not starting Task.Supervisor under a supervision tree is rather that it breaks the semantics of “children shutting down before parents” that you were describing, because no one is monitoring the Task.Supervisor and waiting for it to shut down before shutting down processes higher up in the supervision tree.

Is this correct or am I missing something?

Actually neither really. You already mentioned the issue, but disregarded it as a non issue.

Graceful shutdown will only happen (on accident) if there is enough time left shutting down other parts of the system after the GenServer was considered “shut down” without waiting for the Task.Supervisor. That might or might not be the case. The VM will stop once all applications are shut down no matter if there are processes still in the process of exiting “in the background”. That’s why each “parent” process in the supervision tree (usually a supervisor) needs to wait for children to have exited to ensure their graceful shutdown.

Don’t understand what you mean here. What did I disregard as a non-issue?

Ok this is the issue then. If shutting down the task supervisor takes so long that the shutdown process reaches the top of the supervision tree while the task supervisor is still shutting down tasks, then the VM will stop before the clear shutdown of the Task.Supervisor. This is what I meant with “breaking semantics of children shutting down before parents”, and this is the only way the graceful shutdown of Task.Supervisor can be interrupted. Makes more sense now, thanks

FWIW, this won’t happen because supervisor have a shutdown: :infinity value in their child spec.

Thanks, but the situation we were talking about, described in the OP, was a GenServer starting a Task.Supervisor directly using start_link/1, outside of a supervision tree. So in that case, if the GenServer sends :kill signal to the Task.Supervisor, it will die right away with no terminate callback being called. At least that’s my understanding.

1 Like

This is the important bit to keep in mind :upside_down_face: