OTP Semantics / GenServer starting Genserver, how to organise in Supervisor?

the_wildgoose · April 26, 2021, 10:41pm

Hi, I’m particularly hoping to get the attention of OTP experts like @sasajuric here

I’ve found myself starting to write significant amounts of code where basically I start a genserver, which in turn starts a couple more genservers in the init phase. I then store the pids of these genservers in my genserver and … hope for the best

Its gradually dawning on me that this may not be optimal and I’m having trouble getting my application to stop, which I suspect is due to dangling processes

I guess, 2 questions:

How should I best rearrange my app to avoid this situation?!
I presume that it’s sometimes valid to implement as I have done. However, what do I need to do to make it “safe” and idiomatic? Particularly I think I have corner cases at present if the top genserver is shutdown with :normal, possibly also during an immediate/kill shutdown? And I guess potentially other corner cases?

I will try and present a more concrete example.

I was writing a genserver which monitors a serial port. So I made use of the Nerves.UART library, something like as follows:

defmodule UARTMonitor do
use GenServer

  def start_link(args \\ []) do
    ..
    GenServer.start_link(__MODULE__, args, opts)
  end

  def init(args) do
    {:ok, pid} = UART.start_link()
    :ok = UART.open(pid, args[:uart], some_other_args)
 
    state = %{
      uart_pid: pid,
      other_state: blah
     }
  end

  ...
end

Now, having reviewed some posts by Sasa, and I’m skimming through the Elixir language example and also Elixir in Action ch9 (I confess I read it a long while back and it’s only suddenly starting to make sense now…). These make me think that I should be conforming with a general handwaving guide to always start things through a supervisor, hence my code above feels “wrong”?

Should I have created a DynamicSupervisor and had that call my UART.start_link() ?

However, such a change potentially increases boiler plate quite a bit and leaves me unsure how to always create the semantics that I want?

a) If I want the supervisor to restart my UART process if it dies, I guess then I need a process registry to give me some kind of consistent PID naming (so I can keep calling functions in that process)? However, in the event of spawning multiple processes, it seems sometimes difficult to invent unique dynamic names?

b) I’m not sure how to create the semantics that: if the main process dies, it should kill the UART process also? I guess I would start the UART under (say) a DynamicSupervisor, then just do a “link()” to my own process?

c) What if I needed to guarantee some cleanup that must be done if the UART process stops (not just crashes)? Do I need a third process linked to the UART or can I reliably catch exits from the top process (as it dies) and cleanup the UART process?

d) How does shutdown happen? I can’t get my head around whether it’s enough to start the (say) DynamicSupervisor for the UART after the top level process? Thinking about ensuring hypothetical cleanup runs correctly (imagine it was important to squirt some “bye” message down the UART before closing it)

Going in the other direction, what do I need to do to make the current situation “safe”? Do I need to catch exits? What conditions need to be handled (assuming that if the main process is going down it needs to stop the UART process as well?) Is anything else needed?

There must be some good articles on “how to do this stuff”? It seems like I’m missing some real 101 getting started? It does feel as though things could sometimes be simplified if there was a form of start_link() which would also shutdown dependent processes in the case of a :normal shutdown?

I think I’m on the right track to proceed with (in general) “start everything through a supervisor”. So my next thought is about how to construct libraries and whether there are ways that the library can wrap some or all of these concerns? In general should a library provide some of the supervisor pieces, and if yes, how to offer those in a way that can be inserted into the applications supervisor tree?

I would appreciate some thoughts on how to best structure libraries which need a resource checkout to create some kind of usage handle? So this is your database library and the like. So we need a coordinator which will do some work to acquire a handle, we will create a process to store this handle and do the actual work with the handle. Another process will request the resource.

My thought is that this is:
ResourceAllocator module - does the work to figure out a handle, starts a process through:
ResourceDynamicSupervisor - holds the resource processes

A process needing a handle will call into the ResourceAllocator, which will both start_link the new process into the DynamicSupervisor, but also monitor/link with the calling process (as appropriate).

Is this a good pattern to be using?

If yes, why don’t more libraries include the supervisor part in their implementation? Why didn’t the Nerves.UART library choose to include a supervisor to track these processes? Nothing is black and white, but would it generally be a good strategy to include some kind of dynamicsupervisor in the library?

There is good documentation on Supervisors and Genservers, but I feel we could benefit from more description on good patterns for using these building blocks, especially over how to allocate and cleanup resources at runtime and shutdown. Is it a common pattern to build a new supervisor module which wraps and provides utility functions to start processes (I’m thinking of TaskSupervisor), or is it more normal to keep a separate module for managing this?

If I look at a lot of libraries in the wild it seems like it’s most common to provide just a start_link() function, and leave the library user to figure out how to add to a supervisor tree. However, if I set out to design something which would immediately be stuffed into a DynamicSupervisor, then I would probably structure the implementation quite differently? Possibly even offering a custom supervisor module in the style of TaskSupervisor?

Why is the TaskSupervisor style of implementation not the defacto interface for many libraries? Why are we more commonly offering an interface to start processes and not an implementation to start and supervisor a process in one go? I realise that many cases it will be useful to design a custom supervision strategy, but I sense that a significant number would not?

Anyone else have any good articles on how to build solid apps? How to structure apps which need to start multiple instances of things, how to structure the layers of supervisors and what technique used to manage/wrap the starting of the processes?

Sebb · April 27, 2021, 6:50am

I’m in the same situation, trying to build a decent supervision tree for the first time.
I think nearly all you need to know is in chapter 9 of “Elixir in Action” and Supervisor — Elixir v1.11.4.

I try to answer your questions, but handle with care.

a) If UART always goes down with the UARTMonitor (see (b)) then I don’t think you need that. You can start UART supervised from the monitor an get its new pid.
b) this is the point of supervision trees, see the docs.
c) if the UART is taken down by a supervisor, because a linked process failed, you have to configure the UART’s supervisor so, that the UART process has enough time to clean up. You don’t need an extra process. If you need to clean up, depends on how you access the UART. It may be enough the kill the process holding the UART (the supervisor does that). It’s surely not ok the let the process live as a zombie.
d) you need to catch exits if you want to clean up

You only need a dynamic supervisor if processes are spawned randomly. Eg if the user can open UARTs at will you need one. If there is a known set of processes you don’t.

(handle with care, just learning this)

sasajuric · April 27, 2021, 7:31am

Usually it’s best if each child is running under some supervisor, because supervisor doesn’t run any custom logic, and so it can’t crash or get stuck. As a result, the fault tolerance of the system will be improved.

That said, there are many situations where this approach doesn’t bring much benefits (if any), while it complicates the code and the process structure. This typically happens when one worker is a logical parent of other worker(s), which seems to be the case in the example you posted. In such cases it’s fine to start workers as direct children of a worker.

When you’re directly parenting other workers, you need to pay attention to a couple of things. First, the parent process should trap exits. The main reason for this is to ensure that
the terminate callback is invoked if the parent of the parent is stopping. In this callback you should stop the child and wait for it to exit. This will ensure that the parent stops only after its children stop, which can prevent some subtle race conditions. For the same reason you should also set the shutdown timeout of the parent process to :infinity. Finally, since you’re trapping exits, you need to handle the :EXIT message in some way, e.g. by restarting a child, or stopping the parent.

I discussed this topic in this talk. I also wrote a library called Parent which can help with such challenges. In particular, if you want to parent children from a worker, Parent.GenServer could be useful. See the caveats section for some tips on writing a resilient custom parent.

the_wildgoose · April 27, 2021, 11:35am

OK! Thanks for the replies! This is real lightbulb moment stuff!! Sasa, I saw your video about “Parent” when it came out, but the significance didn’t click then, it is now!

OK, if I may press for a couple more clarifications on practice. Hopefully this is useful for others (might make a good book!)

So to do the easy stuff first:

Sasa, you actually do NOT advocate libraries starting their own global supervisors, just to ensure that processes get a parent. Noted
However, the practice of start_link-ing a random child Genserver from within another parent Genserver seems to be much less bad than I thought? If I understand your words, I count:

a) If the parent genserver chooses to shutdown with exit(:normal), then this leaves a dangling child. However, this is easy to code around.
b) Every other circumstance, a kill of the parent, also triggers the same signal to the child? So the child will get some signal to die? However, there is no control over whether the child will exit before the parent
c) There is no retry on a :terminate signal sent to the child. So if the child ignores the :terminate then you get a dangling child. However, I can’t see why it’s insufficient to simply pause and wait for your children to die? If they don’t die then you don’t die, so your supervisor will call :kill on you, and presumably this gets passed to the children as well? (or does :kill get turned into a :killed when forward to the child?)
d) No hot code upgrading
e) Do you lose some introspection of state with a bare start_link to another genserver?

So it seems like the main thing is ensuring that the children get killed? Right up to the point I want to handle the child restarting as well… Now I’m rewriting a supervisor…

So I sense that the answer to the next would be “use Parent”, but let me sketch our what is a common pattern for me. I would like to know if it can be solved with standard genservers, not least because I also like to use genstate_m for some of my code. I’ll try and be generic, but it’s easier to explain a more real example:

Situation

I have a file containing some config
I want to slurp it, parse it and then you can ask me questions about the contents
If the file changes on disk I want to reload it

The general class of problem is a chain of things, each taking events from the step below, refining and passing up the chain. However, OTP generally seems designed to deal with reliable things that fail rarely. However, I regularly have a situation where I want to wrap an unreliable producer and turn it into a reliable one, without bothering the layers above. The above situation exists in real code for me, also as a process tailing a file, which sends new lines to a parent which processes them and dispatches them. However, there are usually too many other concerns to make this something which feels like it can be wedged into a genstage shape

So the above has a problem in that my file watcher needs to be configured to know who to send events to AFTER it’s been started. So I have a dependency problem. I’m using this library: GitHub - falood/file_system: Filesystem monitor for elixir

Code from the example:

  def init(args) do
    {:ok, watcher_pid} = FileSystem.start_link(args)
    # Call to give the watcher *our* pid to send messages to
    FileSystem.subscribe(watcher_pid)
    {:ok, %{watcher_pid: watcher_pid}}
  end

If I assume that the parent needs to run forever and that the watcher is unreliable within this context, then I need to supervise the child (could be tailing a file, reading tweets from twitter, consuming some API, etc, etc). However, the case here is not uncommon and problematic in that both sides need to know about each others pids??

General statement of problem:

“producer” library which is “unreliable”
consumer/parent process doesn’t want to have to know too much about the producer restarting
the producer needs to be “configured” using knowledge of/from the parent, eg setting up event subscription here, but could be many other things. This dramatically complicates the restart process because it needs knowledge from “already running thing” and “starting up thing”

Can I propose some solutions (other than Parent) and you suggest which would seem best

Idea 1)

Assuming there is only a single instance of parent, start it from a one_for_one supervisor, give it a name
Create a new empty module with just a childspec and a start_link() function which implements the following (perhaps this can be directly squashed into the childspec itself as a custom start function?

    {:ok, watcher_pid} = FileSystem.start_link(args)
    FileSystem.subscribe(ParentProc)
    {:ok, watcher_pid}

Start this just after the parent in the same supervisor.

Results:

This seems to create the correct producer/consumer setup, however,
both processes will now restart without affecting the other (which is likely the intention for this situation)
Stopping the parent won’t stop the child (without extra effort)
this seems difficult to scale to something where you needed lots of similar parents with the same child structure

Idea 2)

Start the parent as above
Start a Task.Supervisor to manage the children (or a DynamicSupervisor with :temporary children)
Have the parent do a Task.Supervisor.start_child(fn → restart_child_and_call_subscribe())
Monitor/link to the task and add code to restart it if it dies

Results

This seems more attractive as the restart code can know the parent pid and “do appropriate setup” on the restart event.
Feels like it could be used in the case of dynamic creation of parent processes, no need for child to know the name of the parent (dynamic restart handler means we can now pass pids during restart phase)
Need to be careful on TaskSupervisor shutdown that we don’t try to restart the child, ie filter (:EXIT, :shutdown) events from the restart monitor. Feels ugly to have to special case this?
Stopping parent probably stops child now (assuming we link() to the child and it’s :temporary)
Have to write our own supervisor restart strategy now…
Still extra code to trap exits, etc

I can see how Parent could clean this up a lot, but I would like to know some good ways to handle this within the current core libraries before I jump and migrate completely.

Having written this missive… I think I can summarise the two ideas more clearly as:
Idea 1) Have a :permanent supervisor monitoring the child. Shoehorn in extra code in the start phase to examine the running environment and configure the child (because the assumption is that we didn’t write the child library, so need to wrap it rather than change it)

Idea 2) Continuously start the child with :temporary, and instead take control over the start/restart from within the parent. Makes the start function simpler to write, but moves a lot more responsibility to the parent to emulate a supervisor

Thoughts gratefully appreciated

sasajuric · April 27, 2021, 1:09pm

Irrespective of this topic, I try to avoid a supervision tree in the lib app as much as possible. In most cases it’s better to provide the API for starting a supervisor process and leave it to the user of the lib to inject that process in their own supervision tree. This allows much more flexibility, b/c the client can leverage the supervision tree to start/stop processes when they want. The most frequent exception to this I’ve experienced is a global process registry. If I need that, I’ll typically start it in the lib’s supervision tree, because it keeps the API simpler, with no particular downsides.

Yeah, it’s fine if you have good reasons.

In a well designed tree a parent process should never be brutally killed. However, if you’re not trapping exits, a parent process could receive a shutdown exit signal from its own parent, and exit immediately. The child will terminate a bit after that. However, it’s possible that in the meantime a new parent has already been restarted, and that it has started a new child. So you might end up with a brief period of two incarnation of the child running at the same time. Or, if the child is registered, a new child may fail to start. To prevent this, a parent should trap exits. In this case, the exit signal from its own parent will be converted into a terminate callback (this is automatically done by behaviours such as GenServer).

If a parent is manually shutting down its child in terminate (as it should), then yes, the parent should wait for the child to stop.

IMO a parent process should always be configured with the shutdown: :infinity to prevent this situation (this is btw a default for supervisors). Otherwise, the parent may be killed forcefully, and you end up with the same subtle race condition as explained earlier.

Not that I can think of.

This won’t work as sketched, because you need a sibling proc, not the parent pid. Getting that proc is a bit tricky, because you can’t invoke e.g. Supervisor.which_child from the child’s start_link (it would cause a deadlock), so you need to do it in handle_continue.

This would be more like it. I used to do this at some point, but then I figured that all these extra supervisors don’t bring a lot of benefits, but at the same time the code becomes more complicated.

You don’t need to jump into Parent. You could instead try to implement it yourself by trapping exit, starting the child process directly, and handling exit messages. I basically wrote Parent after becoming fed up with doing this repeatedly in the scenarios where I had to handle multiple children

Parent API is exposed in layers. Most of the logic is in fact implemented in the foundational imperative Parent module. Mixing in that capability in your own gen_statem (or other behaviours such as GenStage) should be straightforward, both with or without creating a generic Parent.GenStatem.

the_wildgoose · April 27, 2021, 11:56pm

You are (all) incredibly helpful and I just want to thank you for taking the time to offer your experience here. It’s very much appreciated!