Testing a bunch of related processes brings random test fails

So I have something like that:

SystemManager (GenServer)
  RegistrationsSupervisor
     children: [Registration (GenServer), Registration (GenServer) ... ]

It’s a top level worker SystemManager, that starts new registration whenever user visits registration page. Such registration is supervised and contains user’s progress during registration (it’s a wizard).

This set up works pretty well so far, however, I get random test failures. Usually it’s all green, but I get:

  1) test should initialize a registration by spawning a new registration and return a token (HC.Core.SystemManagerTest)
     test/core/system_manager_test.exs:12
     ** (EXIT from #PID<0.222.0>) an exception was raised:
         ** (MatchError) no match of right hand side value: {:error, {:already_started, #PID<0.217.0>}}
             (core) lib/core/registrations_supervisor.ex:9: HC.Core.RegistrationsSupervisor.start_link/0
             (core) lib/core/system_manager.ex:17: HC.Core.SystemManager.init/1
             (stdlib) gen_server.erl:328: :gen_server.init_it/6
             (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3

I don’t load the application for running ExUnit at all, i.e. I added this to my mix.exs:

aliases: [test: "test --no-start"]

This prevents the :core application from booting. I start the SystemManager manually in setup block of the test suite:

{:ok, _pid} = SystemManager.start_link() 

SystemManager starts RegistrationsSupervisor in it’s init callback.

Okay, so there is a race condition in such set up. I think the problem is that whenever a test case process terminates, it brings down all linked processes including my SystemManager. I want that, it’s cool. Next, the situation happens again, that the SystemManager when terminates brings down all linked processes including RegistrationSupervisor. I also want that, I want to clean up the whole process tree I started in setup.

But the whole killing children thing (sic!) is happening in parallel, i.e. it’s asynchronous. Next test already starts and tries to start SystemManager, and it tries to start it’s own instance of RegistrationsSupervisor from it’s init, and it can happen before the cleanup from previous test case finished.

I am currently dealing with this by using a :timer.sleep(10) in my setup blocks. But maybe there is a way in Erlang/Elixir to wait for the process and all it’s children to be fully terminated instead?

If not, any tips on how people test such scenarios?

A solution is to stop the process synchronously using GenServer.stop at the end of the test using the on_exit hook.

Hmm I actually tried that first and I think it’d only work if I propagated the GenServer.stop from the top level processes to all children, and children of children - right?

Ah, that’s the issue related to having a supervisor started inside a gen server. Terminating a supervisor will synchronously terminate all the children. A gen server won’t do that by default.

This is one of the reasons why it is generally advised that if a process has children that trap exists (as supervisors do), it should itself trap exits itself and handle shutdown gracefully. Otherwise, issues like this may arise.

EDIT: this test failure only highlights a race condition bug, that would probably present itself in production in case the SystemManager process were to be restarted by its supervisor.

Ah yes, you are obviously right.

I am trying something new with this one - i.e. been having flat structures so far. This time I want to go multiple levels with the app. This is very much helpful, thank you so much!

Okay, I gave this another shot. I compiled a simple synthetic example that illustrates the problem I am having:

This actually works as I want when I uncomment the line 29, i.e. call GenServer.stop(ChildSupervisor) from within the ParentServer's terminate/2 function.

So it seems the solution is not only to Process.flag(:trap_exit, true) in both of my GenServers - this does ensure the terminate/2 function gets called properly, but also I need to manually stop the processes I started in my init function, i.e. cleanly shutdown the ChildSupervisor.

Does this look about right way to build supervision trees, where I mix GenServers with Supervisors and have the need to shut down/restart them as I please?

@sasajuric can I pick your brains on the above? Does the gist look about right?

EDIT: I am getting the feeling that what I do is completely wrong. And I should not have GenServers that are parents of Supervisors. Supervisors would have parents that are Supervisors only, and all of my GenServers would be leafs in the tree.

Ok, after reading some more, and looking at some examples of projects that use supervision trees, it does look indeed a lot like I do want my GenServers to only be leafs in the tree. It’s possible to not do it like that, but it’s tedious if you have to trap exit and also there can be other issues like when terminate/2 is not being called etc. Also, a single error in a GenServer that has children, would take down the whole tree.

I think the initial design is flawed, which results in problems I had. I have re-shaped the supervision tree in a form that avoids GenServers to have supervised children, and I’m pretty happy with how it turns out so far.

2 Likes

I was on a trip to Empex, so didn’t have the time to look into this. I see you pretty much figured it out, but here’s my comment anyway.

The root cause of your issue is that when a process terminates, it is not guaranteed that the linked processes are immediately taken down. If you want to ensure that children are taken down before the parent stops, you should trap exits, and stop the children from the parent’s terminate callback.

This is precisely what OTP supervisor does, which is why I usually say that in most cases a parent process should be an OTP supervisor, while only leafs should be workers.

There are some cases where it’s better to have a worker parenting other workers, but those are IME rare exceptions. If you do go down that road, you need to do some work of the supervisor, which includes trapping exits, and terminating your child processes. Otherwise, you might miss some edge cases, such as this one where the parent is down, but its children are still alive, even if only for a short amount of time.

1 Like

Thank you so much for confirming that. It is very helpful.