Parent - custom parenting of processes

sasajuric · April 8, 2018, 7:08pm

Hi folks,

I’ve written a small experimental library motivated by a couple of scenarios I’ve encountered in production. The library is tentatively called parent, and it aims to help with scenarios where a GenServers is used to directly parent children.

At this moment this is an exploration, so it’s not published on hex. I’ve written in more details about the library, motivation, and example scenarios, and I’m interested in any sort of feedback. See GH repo for more details.

Looking forward to hear your thoughts!

mbuhot · April 8, 2018, 10:02pm

I’d definitely appreciate a library like this to simplify code that manages child processes.

When should I put some Supervisor-like capabilities in a GenServer,
vs creating a custom Supervisor, eg ConsumerSupervisor ?

sasajuric · April 9, 2018, 7:14am

That’s a very good question!

I think that putting supervisor-like capabilities in a GenServer is essentially not much more different than creating a custom supervisor. They both yield a module which is a GenServer (or GenStage, gen_statem, …) and acts as a parent of its children.

However, the former is a hardcoded solution, so I think it’s more appropriate for unique scenarios, whereas the latter would be a better option when you need it in multiple places.

So taking the examples from the project readme, if you wanted to provide a generic abstraction for periodic execution, you might make it a supervisor, so you could do start it like Periodic.start_link(child_spec, opts), where Periodic is a supervisor, child_spec is the spec of a job, while opts is a data-driven interface to control periodic execution.

It’s worth noting that the main thing that makes supervisor special is the type: :supervisor in the child spec. As far as I know, this field is only used by the release handler when doing code reloading. When the release handler wants to determine the process hierarchy, it starts with the top process, and recursively goes deeper for any process which is marked as a supervisor (with type: :supervisor). For every supervisor, the release handler will ask it for the list of its children.

Therefore, to make the custom supervisor work with code reloading, the module needs to handle Supervisor (or more precisely Erlang’s :supervisor) specific messages, such as :which_children, and return the result in the same shape. This is somewhat hacky, so I advise caution with going there. I think that in many cases a custom supervisor can be also implemented with two standard supervisors and a GenServer (or any other desired behaviour).

mjadczak · April 10, 2018, 1:16am

I love it! I’ve also encountered situations where I’d write code like this by hand, and now I realise it was less robust than I thought (e.g. I didn’t realise that killing a process does not take down any children it start_linked automatically), so it’s good to have a library out for that. As you said, this is one of those “know it when you need it” types of libraries, rather than something which should be a default for supervising things.

cmkarlsson · April 10, 2018, 1:30am

It depends on how you kill it. A normal signal will not kill linked processes and if the linked process traps exists it depends on how they are handled. Sending a kill signal will always kill linked processes.

sasajuric · April 10, 2018, 8:00am

When you take down the parent with a non normal reason, a linked child will usually be stopped too. However, there are some exceptions, as pointed out by @cmkarlsson. In addition, there is a slight ordering problem. If the parent is not explicitly taking down its children, a child might linger on for a bit longer before it’s taken down.

So it’s not completely guaranteed that when the parent stops all of its descendants are already down. This can lead to some strange race conditions, which are admittedly not very likely to happen, but are still possible.

IMO, a good approach to building an OTP supervision tree would be as follows:

Every parent is a supervisor.
A child which is a supervisor has the :shutdown option set to :infinity (this is the default for supervisors).
A supervisor process (i.e. a parent) is only taken down through its own parent.

Such approach guarantees that a parent process terminates only after all of its descendants are down. I believe that this is a clean approach which completely eliminates some possible race conditions.

When you’re manually parenting children, you can ensure the same in the terminate/1 callback, but it will require some work, and you need to remember to do it. I’ve just browsed through some of our code, and noticed that explicit children termination is not done, probably because I forgot to implement it when I wrote the original code

sasajuric · May 31, 2018, 7:49am

I added some docs, and pushed the library to hex:

The library also includes a lightweight scheduler for periodic jobs, which provides finer-grained control with respect to OTP supervision trees and requires no app env based configuration.

We’ve recently started using the library in our project. It’s still early days, but so far it looks good.

sasajuric · January 27, 2020, 8:51am

Released the version 0.7.0 with various improvements in periodic job scheduler.

hauleth · January 27, 2020, 9:19am

I have just spotted it, but are there any reasons why not use Director?

sasajuric · January 27, 2020, 10:25am

This is the first time I’ve heard of this library, thanks for mentioning it! Obviously, the main reason why I didn’t use it is because I didn’t know about its existence at the time I wrote Parent

Let me first briefly summarize Parent’s intention. It’s basically a GenServer-like behaviour where callback code can do regular GenServer stuff (handle calls, cast, infos), as well as start/stop children dynamically and react to their termination. The behaviour itself also takes over the supervisor roles, ensuring proper child termination, and presenting itself to the outer world as a supervisor (so any logic traversing the supervision tree would also travers Parent’s children).

In other words, Parent is basically a fusion between Supervisor and GenServer. In theory you could reimplement Supervisor on top of parent, though I’m not suggesting doing that.

Director seems to share some similar goals, but looking at the callback spec, the GenServer part is missing, so it seems that director can only be controlled externally (from outer processes). If that’s indeed the case, it wouldn’t be fit for any of the scenarios for which I wrote Parent (all of which are mentioned in the rationale doc). For example, I couldn’t write Periodic the way it is written now, because it is based on internal handling of send_after messages.

Beyond that, at first glance Director seems packed with a bunch of other features, such as managing children of other processes, and custom ETS or Mnesia based registry. Parent is deliberately designed with a small feature set to keep it easy to reason about. By saying that Parent is a GenServer-like behaviour which has some supervisor roles, we’ve basically explained the gist of the lib in terms of regular OTP parlance. I don’t expect a seasoned OTP developer should have to dive into the code to understand what the behaviour does. Such design keeps Parent simple, and at the same time very flexible, since you can implement arbitrary behaviour on top of Parent.

It’s worth repeating that this is the first time I’ve heard of Director, so obviously I’m not familiar with how it works, so take my comments with a grain of salt

sasajuric · January 28, 2020, 8:35am

Wrote a blog post which presents Periodic in more details. Hope you’ll enjoy it!

sasajuric · September 8, 2020, 8:20am

Recently I’ve done some work on Parent which basically adds the remaining supervision features, such as process restarts and lifecycle bindings. I admit that this was not something I intended to do when I first started working on parent, and I know that reimplementing a complete supervisor is a controversial idea, but I still did it for a couple of reasons:

I found myself occasionally manually implementing a naive one_for_one restart strategy on top of Parent.GenServer, so I figured it would be nice to have something like that done by the foundational abstraction.
I wanted to explore different approaches to binding process lifecycles, i.e. alternatives to rest_for_one and one_for_all.
I wanted to explore the idea of bundling a basic registry inside the supervisor.
It was fun

You can see some highlights in the branch readme. The implementation is still somewhat rough around the edges, and the docs need more work. Most importantly, I want to test drive this branch on some production. Therefore, this work won’t be merged very soon, but in the meantime I’m curious to gather any feedback on these ideas. Note that I don’t advise using this branch in production, because the API is still very unstable.

LostKobrakai · September 8, 2020, 8:43am

This one sounds quite interesting. In the beginning of the year I worked on a nerves project, where I basically had a chain of “requirements” like “ssd is attached > dockerd runs > docker images are fetched > docker app is started”, which should work like rest_for_one, but not quite. Basically if one stage doesn’t recover is should only involve the next step up the chain instead of restarting the whole chain. If that doesn’t work go one step up again, …. With supervisors this would’ve been quite some nesting (I never fully implemented it).

sasajuric · September 8, 2020, 8:59am

Interesting problem. The lifecycle bindings introduced here are roughly similar to rest_for_one and one_for_all, so I don’t think they would help in this case, but Parent.GenServer with handle_child_terminated could work here, and I think this can already be done with the current package release.

hauleth · September 8, 2020, 9:32am

Out of interest, why do so with:

Elixir instead of system supervisor?
Docker instead of systemd/podman?

Both of these approaches would make it much more straightforward and IMHO clearer.

LostKobrakai · September 8, 2020, 9:38am

Nerves afaik doesn’t come with systemd and docker was meant to run only postgres (which did change though). I didn’t know of podman at the time. It’s likely the better option.

wolf4earth · September 8, 2020, 6:09pm

Where exactly would I read more about how links and exit reasons interact? I assume it’s somewhere in the OTP docs?

hauleth · September 8, 2020, 6:42pm

Indeed. It can be found in docs of erlang:link/1.

sasajuric · October 12, 2020, 7:45am

Version 0.11.0 with support for rich supervision is out.