Eprof sometimes claims it isn't started although it was just started (flaky tests)

PragTob · June 19, 2022, 7:22am

Hey everyone,

ever since the introduction of profiling into Benchee the test suite has experienced flakiness.

We basically call the built-in tasks like this Eprof task in Elixir.

Now, with 40 build configuration running, quite often we get a failure like this one:

  1) test hooks all kinds of hooks with inputs work (Benchee.ProfileTest)
Error:      test/benchee/profile_test.exs:164
     ** (exit) exited in: :gen_server.call(:eprof, {:profile_start, [], {:_, :_, :_}, {:erlang, :apply, [#Function<6.45063716/0 in Benchee.Benchmark.Runner.main_function/2>, []]}, [set_on_spawn: true]}, :infinity)
         ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
     code: retrying(fn ->
     stacktrace:
       (stdlib) gen_server.erl:223: :gen_server.call/3
Warning:        (mix) lib/mix/tasks/profile.eprof.ex:197: Mix.Tasks.Profile.Eprof.profile_and_analyse/2
       (mix) lib/mix/tasks/profile.eprof.ex:186: Mix.Tasks.Profile.Eprof.profile/2
Warning:        (benchee) lib/benchee/benchmark/collect/profile.ex:17: Benchee.Benchmark.Collect.Profile.collect/2
       (benchee) lib/benchee/benchmark/runner.ex:291: Benchee.Benchmark.Runner.collect/3
Warning:        (benchee) lib/benchee/benchmark/run_once.ex:15: Benchee.Benchmark.RunOnce.run/3
       (elixir) lib/enum.ex:783: Enum."-each/2-lists^foreach/1-0-"/2
Warning:        (elixir) lib/enum.ex:783: Enum.each/2
       (benchee) lib/benchee/profile.ex:59: Benchee.Profile.profile/2
Warning:        (ex_unit) lib/ex_unit/capture_io.ex:151: ExUnit.CaptureIO.do_capture_io/2
       (ex_unit) lib/ex_unit/capture_io.ex:121: ExUnit.CaptureIO.do_capture_io/3
Warning:        test/benchee/profile_test.exs:166: anonymous fn/0 in Benchee.ProfileTest."test hooks all kinds of hooks with inputs work"/1
       (benchee) test/support/test_helpers.ex:17: Benchee.TestHelpers.retrying/2
Warning:        test/benchee/profile_test.exs:165: (test)

Which is weird, as the elixir code explicitly starts eprof (yes even in 1.9.4, I double checked, those are lines 196-197):

    :eprof.start()
    :eprof.profile([], fun, Keyword.get(opts, :matching, {:_, :_, :_}))

Some thoughts:

Now you could assume cross test taint but the test file under question runs with async: false.
The return values of :eprof.start() are also basically “started correctly” or “I was already started, here is my pid” so I don’t think checking these would help Erlang -- eprof
It seems to happen mostly with eprof, but that might be because eprof is our default profiler.
rerunning a failed run with the same seed also works (I was worried that a specific order of tests might crash eprof or something but since we start it again )

Basically I’m at wits end. I now started retrying these tests (finally correctly ) but that also doesn’t feel right. I hope I’m just missing something stupid, which I’d be happy for y’all to tell me about

If anyone had any idea why these keep failing I’d be happy to hear them and fix it for real

Thanks!
Tobi

PS: cc @pablocostass if you have any idea

al2o3cr · June 20, 2022, 4:05am

It doesn’t seem likely, but what about a race between one test’s :eprof.stop actually deregistering the process and the next test’s :eprof.start checking for that registration?

That could result in :eprof.start returning {:error, {:already_started, PID}} with a PID that is already stopping.

PragTob · July 19, 2022, 12:08pm

(sorry for the late response)

That seems like a good idea. I had hoped that :eprof.stop() was sync/waited for all to finish but maybe it isn’t

The docs say nothing about it being sync or async, however the type signature makes it look like it’d be sync stop() -> stopped

I’ll check it out, thanks!

PragTob · July 19, 2022, 12:33pm

Unless I’m missing something it all looks really sync to me although I’m not sure in which order exactly GenServer itself handles stuff…

stop()  -> gen_server:call(?MODULE, stop, infinity).

handle_call(stop, _FromTag, S) ->
    {stop, normal, stopped, S}.

al2o3cr · July 19, 2022, 3:22pm

My reading - stop() will return when the GenServer replies, which is done right after calling terminate:

github.com

erlang/otp/blob/09c601fa2183d4c545791ebcd68f869a5ab912a4/lib/stdlib/src/gen_server.erl#L1191-L1196


      
          	{ok, {stop, Reason, Reply, NState}} ->
          	    try
          		terminate(Reason, ?STACKTRACE(), Name, From, Msg, Mod, NState, [])
          	    after
          		reply(From, Reply)
          	    end;

but the name won’t be deregistered until the GenServer’s process exits.

PragTob · July 19, 2022, 5:26pm

Interesting - thank you!

So, checking and waiting in case the process isn’t shut down at the end of the test might solve it - I’ll give that a shot.

Unless you got any other ideas which I’d be happy to hear.

Thank you very much