DynamicSupervisor stop issue

sbuttgereit · November 28, 2022, 11:52pm

OK, I’ve come across an issue that’s got me stumped and I’ve not been able to find anything that would suggest why I’m seeing the behavior I’m seeing. The issue here is that I’m seeing DynamicSupervisor.stop/3 fail under a specific integration test scenario that I have. The thing that’s really throwing me off is that I can resolve it with IO.inspect/2 (more on this in a minute).

The function that’s failing looks like this (because it is this):

defp stop_application_supervisor(application_name, opts) do
    opts =
      MscmpSystUtils.resolve_options(opts, supervisor_shutdown_timeout: 60_000)

    DynamicSupervisor.stop(
      {:via, Registry, {@registry, {:application_supervisor, application_name}}},
      :normal,
      opts[:supervisor_shutdown_timeout]
    )
  end

A couple notes, “application” here does not refer to an OTP application; the project this is a part of can have one or more “applications” running which are just a collection of services (database connectivity, genservers) which work together. I put the services under a DynamicSupervisor specific to each of these applications since it reasonably aligns with a service/failure domain. This function is in an Elixir project that is basically a library for my broader project. This library has its own tests and the code in this function is exercised by the test and it works as designed in that context. I’ve run this test a large number of times to test for intermittent failures and found none. So far so good.

I’ve made that library a dependency of a higher level Elixir project and am writing integration tests there. In this case, the test code is almost cut and paste from the library test code. But in the higher level application this call is failing with:

test/integration_test.exs:242
     ** (exit) exited in: GenServer.stop(
           {:via, Registry, {MscmpSystInstance.Registry, {:application_supervisor, "test_app"}}}, 
           :normal, 60000)
         ** (EXIT) exited in: :sys.terminate(#PID<0.1030.0>, :normal, 60000)
             ** (EXIT) shutdown

OK Something to troubleshoot and this is where the confusion starts to crop up and thus this topic posting. To start troubleshooting I just wanted to get a sense that the data was looking correct so I threw in an IO.inspect/2 into that original function:

defp stop_application_supervisor(application_name, opts) do
    opts =
      MscmpSystUtils.resolve_options(opts, supervisor_shutdown_timeout: 60_000) |> IO.inspect()

    DynamicSupervisor.stop(
      {:via, Registry, {@registry, {:application_supervisor, application_name}}},
      :normal,
      opts[:supervisor_shutdown_timeout]
    )
  end

and then I run the test again so see the failure along with the parameters and options… and the test works as expected. Maybe this is intermittent… I run a bunch of times with the inspect call there and a bunch without. Every time I don’t have the inspect call my higher level project integration test fails… every time I have it there the test succeeds. And so the questions here are what am I missing and why is IO.inspect/2 making this kind of difference?

To the best of my (limited) knowledge IO.inspect/2 shouldn’t be changing the outcome at all. The best I can guess at is that there’s some sort of timing thing going on; a DynamicSupervisor will be a different process to that from which I’m calling stop after all, though I would expect the DynamicSupervisor.stop/3 function to be synchronous with the shutdown of the DynamicSupervisor process at least before determining a return value.

The only other bits I can think to add without throwing out more code is that tests, both those that regularly succeed and fail, do call the function in question in a loop for multiple of these “application” supervisors; naturally there is other code being called, too, given that the function I’ve pasted above is private.

Thanks in advance for anyone taking a look at this.

Steve

al2o3cr · November 29, 2022, 12:12am

One way I’ve seen flaky, timing-sensitive tests happen is when there’s a race between the process registering its name and an attempt to shut it down, BUT that doesn’t appear to be the case based on the backtrace including :sys.terminate with a PID:

github.com

erlang/otp/blob/f8c6b3ed3cc40ce4698324bb4e2bb0a046543e10/lib/stdlib/src/proc_lib.erl#L1078-L1082


      
          stop(Process, Reason, Timeout) ->
              Mref = erlang:monitor(process, Process),
              T0 = erlang:monotonic_time(millisecond),
              RemainingTimeout = try
          	sys:terminate(Process, Reason, Timeout)

The exit reason of :shutdown is also odd; does anything in your application exit with that reason?

A thing that does use that reason is the max_restarts mechanism in Supervisors…

sbuttgereit · November 29, 2022, 2:14pm

Thanks for your thoughts on this.

Unfortunately not. Tracing through the code a bit more there are additional supervisors being shutdown, including through to Ecto Repos, but all of them including in Ecto are all shutdown with reason :normal.

The Registry calls were a good idea. As you point out, I don’t think they’re contributing in this case, but it seems like another place where indeterminate timing could sneak into the picture.

With a clearer mind this morning, I was able to show that all IO.inspect/2 was doing was providing extra time. I replaced the inspect call with: Process.sleep(1) and was able to duplicate my results. Whenever the sleep is not in that function 100% failure with the shutdown exit, when the sleep is there, 100% success. I guess I’m just surprised that if this is a race condition with such narrow timing being the differentiating factor, that I’m seeing such regularity in the results (I would expect to see some intermittency in results); and that essentially the same test in a different spot succeeds 100% percent of time without the extra time inserted.