Debugging an unexpected process exit

zachallaun · June 1, 2023, 2:46pm

Hi friends,

I’m working on a contribution to Finch and am having a hell of a time debugging an unexpected process exit that’s causing a test to fail.

The code being tested is rather unfinished and very much a work-in-progress, but I wanted to have a baseline before I start refactoring and moving things around. When the test reaches the call to cancel the async request, the test fails (not immediately, but after a very small period), with the following message:

Things I’ve tried so far:

Re-creating the issue without using Finch (I couldn’t)
Running the code outside of the context of a test (it works as expected)
Creating the process with a Task supervisor (what’s currently in the code)
Creating the process with Task.async
Creating the process with spawn
Various exit trapping

I have a hunch this has something to do with ExUnit’s GenServer, but I’m really struggling to find more information and I’m at a loss for how to debug further.

al2o3cr · June 1, 2023, 4:24pm

The message looks a lot like the symptom from this issue:

github.com/PSPDFKit-labs/bypass

Crash on shutdown in `dispatch_awaiting_callers`

opened 04:19PM - 01 Dec 21 UTC

digitalcora

I noticed a flaky test failure with the reason `** (exit) shutdown`, and the fol…lowing output logged: ``` ** (stop) exited in: GenServer.stop(:normal, :normal, :infinity) ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started (elixir 1.12.1) lib/gen_server.ex:972: GenServer.stop/3 (bypass 2.1.0) lib/bypass/instance.ex:385: Bypass.Instance.dispatch_awaiting_callers/1 (bypass 2.1.0) lib/bypass/instance.ex:62: Bypass.Instance.handle_info/2 (stdlib 3.14) gen_server.erl:689: :gen_server.try_dispatch/4 (stdlib 3.14) gen_server.erl:765: :gen_server.handle_msg/6 (stdlib 3.14) proc_lib.erl:226: :proc_lib.init_p_do_apply/3 ``` Looking at the referenced code, it seems `GenServer.stop/3` is [not being called correctly](https://github.com/PSPDFKit-labs/bypass/blob/d1c445a/lib/bypass/instance.ex#L385): the first argument is supposed to be a reference to a server, but the Bypass code is `GenServer.stop(:normal)`. So if there are any `callers_awaiting_exit`, this code will always crash.

This could be happening when the Task terminates while still waiting for a reply from Bypass, causing the Bypass process to crash and taking the test process down along with it.

You could try trapping exits in the test and calling receive explicitly to see what’s bubbling up.

Another thing to try: add a sleep for longer than the sleep in the Bypass.expect block to the very end of the test, so that the Bypass process is definitely finished before the test ends.

zachallaun · June 1, 2023, 5:22pm

That looks really promising. I cloned Bypass and changed the dep to a local path-based one and then tried fixing the line referenced in that issue and removing it altogether, both to no avail. It does seem likely that it’s taking down the Bypass server somehow.

hkrutzer · June 1, 2023, 5:27pm

Maybe sys debug can help? Debug tracing · Elixir Recipes

zachallaun · June 1, 2023, 7:38pm

That was a good idea.

......................................self: #PID<0.1343.0>
bypass: #PID<0.1344.0>
async req pid: #PID<0.1360.0>
finch: #PID<0.1357.0>
*DBG* <0.1344.0> got {'DOWN',#Ref<0.1067949558.531890179.24259>,process,
                             <0.1363.0>,shutdown}
*DBG* <0.1344.0> new state #{callers_awaiting_down => [],
                             callers_awaiting_exit => [],
                             expectations =>
                                 #{{any,any} =>
                                       #{expected => once_or_more,
                                         'fun' =>
                                             #Fun<Elixir.FinchTest.1.116636333>,
                                         path_parts => [],request_count => 1,
                                         results =>
                                             [{exit,{exit,shutdown,[]}}],
                                         retained_plugs => #{}}},
                             monitors => #{},pass => false,port => 43377,
                             ref => #Ref<0.1067949558.531890179.24222>,
                             socket => #Port<0.203>,
                             unknown_route_error => nil}
*DBG* <0.1344.0> got call on_exit from <0.1364.0>
*DBG* <0.1344.0> sent {exit,{exit,shutdown,[]}} to <0.1364.0>, new state #{callers_awaiting_down =>
                                                                            [],
                                                                           callers_awaiting_exit =>
                                                                            [],
                                                                           expectations =>
                                                                            #{{any,
                                                                               any} =>
                                                                               #{expected =>
                                                                                  once_or_more,
                                                                                 'fun' =>
                                                                                  #Fun<Elixir.FinchTest.1.116636333>,
                                                                                 path_parts =>
                                                                                  [],
                                                                                 request_count =>
                                                                                  1,
                                                                                 results =>
                                                                                  [{exit,
                                                                                    {exit,
                                                                                     shutdown,
                                                                                     []}}],
                                                                                 retained_plugs =>
                                                                                  #{}}},
                                                                           monitors =>
                                                                            #{},
                                                                           pass =>
                                                                            false,
                                                                           port =>
                                                                            43377,
                                                                           ref =>
                                                                            #Ref<0.1067949558.531890179.24222>,
                                                                           socket =>
                                                                            nil,
                                                                           unknown_route_error =>
                                                                            nil}


  1) test async_request/3 with HTTP/1 can be canceled with cancel_async_request/2 (FinchTest)
     test/finch_test.exs:771
     ** (exit) shutdown

................................
Finished in 7.3 seconds (5.0s async, 2.2s sync)
71 tests, 1 failure

Randomized with seed 853761

So it looks like Bypass is sending an exit shutdown message to what I presume to be the test server. I’m not entirely sure why or what I can do to prevent it.

Edit: Looking at the code, it’s Bypass that’s exiting here.

Edit 2: And, indeed, if I modify Bypass to not exit at that line, the test works.

pcruz · January 19, 2024, 11:32am

I know it’s a bit late, but I have also encountered this problem, and I was able to resolve it by calling the Bypass.down function before the end of the test:

Bypass.down(bypass)

I hope this helps someone in the future.

RudManusachi · March 1, 2024, 11:15pm

Bypass.down(bypass) for us didn’t work.
However, what works is “overriding” the on_exit for the bypass so that it doesn’t try to verify the expectations.

on_exit({Bypass, bypass.pid}, fn -> :ok end)

Some notes on my investigation:

In order to reproduce the issue we could do the following:

bypass = Bypass.open()

Bypass.stub(bypass, "GET", "/", fn conn ->
  # suppose the response is slow
  Process.sleep(1000)
end)

Task.start(fn -> Req.get!("http://localhost:#{bypass.port}") end)
# give it a time to be sure that Task makes the request prior test to finish
Process.sleep(100)

Basically the idea is to simulate that in the code, despite being asynchronous, actually makes the request that reaches out Bypass server… But the test exits prior to the Bypass sending the response and. And since the test exists it shuts down Bypass!

However, Bypass.open() prematurely sets on_exit callback where it makes a synchronous GenServer.call to Bypass to verify all its expectations. And the state of Bypass at that moment is erroneous :exit, :shutdown that’s being raised.

Knowing those details, I didn’t really want to comment that line in the Bypass source, since I think it’s still valuable for Bypass.expect (when the request was made, but test finishes prior to Bypass responding… though the error could have been better, but in that case the developer might want explicitly synchronize the test with Bypass maybe by some message passing and assert_receive)

I made this patch but I thing it’s not as reliable as overriding on_exit callback… since for Bypass.stub there should be no verification of expectations… at lieast not on the number of requests