Is it bad practice to start other services in genserver's init function?

megnincs · April 26, 2017, 7:29pm

Let’s say we have a GenServer that uses some other (mazbe 3rd party) services while doing it’s job. Is it a bad design if the server just starts those other services in its init function with {:ok, pid} = OtherService.start_link() and keeps the pids of those server processes in the state for further use?

I know I could start them under a supervisor, give them names and use the names in my GenServer or somehow send the pids to the GenServer. But is this any better than just linking them directly to the GenServer and not having the registered names or the burden of sharing pids? Links will do their job when it comes to failures or restarts. Is it okay to have a process hierarchy where it’s no longer true that leaves are workers and the others with children are supervisors? My GenServer also has linked children in this case but it’s definitely not a supervisor. Shall I use Supervisor.Spec.supervisor() instead of worker() when starting a GenServer like this, I mean one with an internal hierarchy of processes?

BTW, isn’t it strange that Registry has to be introduced in a supervisor with Supervisor.Spec.supervisor() and not worker()? Isn’t this leaking some implementation details that the user of the Registry isn’t supposed to know about?

sasajuric · April 26, 2017, 8:55pm

I used to use this approach previously, but I mostly avoid it now. That said, it’s not necessarily a bad design if you’re aware of some caveats. TL;DR is that the parent needs to assume some roles of the supervisor, I’ll expand more about this below.

This is not necessarily true. If the child process is trapping exits, then it might linger on for some time, possibly even a long time, for example if it’s busy when the parent terminates.

Even if the child is not trapping exits, I think that it’s possible that it briefly lingers after the parent dies. It might happen that for a brief period of time you end up with two copies of the child process. The reason is that exit signals are themselves asynchronous. Meaning that when the parent terminates, the child doesn’t die immediately.

Finally, if the parent happens to stop with the reason :normal, the child will not notice it (unless it’s trapping exits), and will linger on forever.

In addition, since you’re linking the parent to all the children, a termination of a single child will terminate the parent and all the siblings. Perhaps that’s fine, but if not, you need to do something about it.

The solution for all these problems would be to make the parent trap exits, and implement the terminate callback. Inside that callback, you would need to stop your children one by one, preferably in the opposite order of them being started. To stop the child, you’d need to monitor a child, send it a {:shutdown, reason} exit signal, then wait for the corresponding :DOWN message. If the message doesn’t arrive after some time (say 5 seconds), then you send a :kill exit signal, and wait again for the :DOWN message.

Finally, you need to handle the :EXIT message in the parent, which can arrive if a child terminates. In this case, you need to decide what to do. You can either return {:stop, reason}, which will in turn invoke the terminate callback and terminate other children, or you can restart the child.

In addition to all this, code reloading won’t work properly for child processes. The reason is that the release handler walks the tree recursively, assuming that only a supervisor process can be a parent of other processes. This might be hacked around if the parent is pretending to be a supervisor. You need to use supervisor instead of worker when insert the parent in the supervision tree. In addition, I think you’d need to handle a which_children call message, and return the same result as the supervisor would return.

So there are some potential subtle issues, most of which revolve around the proper termination and process cleanup. Those issues can be resolved, and except for code reloading, the solution is not very complicated. So if you have some strong reasons to avoid the extra supervisor, you can consider this option. An example of this in practice is Phoenix socket, which is a GenServer which is a direct parent of channel processes.

Personally, I most often go with Supervisors, because then I’m more confident about the proper termination behaviour.

I agree that this is leaking some implementation detail. There have been some discussions on the core list to improve this situation.

NobbZ · April 26, 2017, 8:57pm

When a process dies its pid will vanish, but your GenServer still does have it in its state, and might use that faulty pid during following message processing until that message arives that tells him that a service has died and needs to be restarted.

By having a separate SuperVisor and sending messages to named processes, you will reduce the chance to run out of sync drastically.

A process that is responsible to restart other processes should be focused on only this purpose to avoid “gaps”.

cmkarlsson · April 26, 2017, 8:57pm

In my opinion this is a valid design option. I’ve seen this done for example when a different supervision behaviour, like exponential back off algorithm for supervision restarts, is needed.

One drawback with this approach is that it may lead you down the route of re-implementing your own supervisor. There may be tricky edge cases lurking

Yes. There is nothing stopping a GenServer from starting other GenServer or processes not directly in the process tree. The important thing is that you use linked processes and that the top process in the tree is in a supervisor.

No, I would keep this as a worker.

What you need to think about is how things can fail. If one of your processes started in your GenServer dies, what would you like to happen? If you just start it up in the GenServer this GenServer will also die because they are linked and assuming the GenServer is in a supervision tree it will get restarted.

I’d recommend that you prototype your initial design and then play around with various failure scenarios and make sure it behaves as you want and expect

A few things to get you an idea:

Top GenServer dies and gets restarted - Do all processes gets started?
Child process can’t be started in the GenServer init call. What happens?
Child process dies, what happens, does everything recover?
If 3rd party service goes away for a longer period of time - what happens? Is this what you want?

rvirding · April 26, 2017, 10:19pm

I just want to add to this. If, in a supervisor, you start a child as a supervisor then it must actually be a Supervisor behaviour not just something which “happens” to supervise things. In a supervision tree the supervisors are expected to behave as supervisors when things happens and this is something a GenServer won’t do[1]. In the same vein all the processes in a supervision tree should be behaviours as they are expected to behave in a certain way, which behaviours do[2].

You can implement a “real” supervisor yourself, preferably on top of a GenServer. This is actually how the Supervisor behaviour is implemented.
If you implement your own process without using a behaviour to use in a supervision tree it has to follow all the “rules”. This is actually not difficult and there is support which does most of the work.

Robert

megnincs · April 27, 2017, 5:40am

When a process dies its pid will vanish, but your GenServer still does have it in its state

It’s not gonna happen because the processes are linked. The problem you are talking about comes when you start the service processes under a supervisor and you need a mechanism to always have the proper pid of those services in the client. You can monitor them and take over the responsibility from the supervisor to start them again just to know the new pid. Or if it is possible, you can ask those processes to register themselves to your GenServer (or into a Registry) when starting up. This hassle is why it’s tempting to just link those services to my GenServer directly.

By having a separate SuperVisor and sending messages to named processes, you will reduce the chance to run out of sync drastically.

In that case there comes in another problem. There’s a short period of time when the registered name is not available. Processes can crash all around your system while the gap is happening if they trying to reference the process with the name. (Except GenServer.cast, because that has changed recently and it no longer crashes when the name is not registered, but, for example, GenEvent.notify and GenServer.call crashes. I believe the async versions, like GenEvent.notify, shouldn’t crash.) This is another reason for me to avoid named processes when possible. I feel the linked version more secure but less elegant.

sasajuric · April 27, 2017, 3:51pm

When talking about two processes, it can never be fully guaranteed that one process always has a reference to a live process. This is simply a fact of those two processes running separately, possibly on different scheduler threads. Therefore when one terminates it takes some time, no matter how short, for this info to propagate.

This same problem exists in the linked version. If the parent is trapping exits, and processing a message while the child crashes, the pid it holds is no longer valid.

In fact, it is more likely that the Registry information will be updated more quickly. The reason is that in the “linked” setup (worker is also a manual supervisor), the parent is doing something else besides supervising, so it is more likely that the parent is busy when a child crashes. In contrast, if the process is supervised by an OTP supervisor, then the supervisor might restart it at the same time while the other GenServer (the former parent) is busy.

The following scenario is possible with supervised children, while not possible with the manually supervised one (linked):

GenServer A starts handling the message X
At the same time GenServer B crashes
Parent supervisor starts new GenServer B
While handling the message X, GenServer A talks to the restarted GenServer B