Eprof sometimes claims it isn't started although it was just started (flaky tests)

:wave:

Hey everyone,

ever since the introduction of profiling into Benchee the test suite has experienced flakiness.

We basically call the built-in tasks like this Eprof task in Elixir.

Now, with 40 build configuration running, quite often we get a failure like this one:

  1) test hooks all kinds of hooks with inputs work (Benchee.ProfileTest)
Error:      test/benchee/profile_test.exs:164
     ** (exit) exited in: :gen_server.call(:eprof, {:profile_start, [], {:_, :_, :_}, {:erlang, :apply, [#Function<6.45063716/0 in Benchee.Benchmark.Runner.main_function/2>, []]}, [set_on_spawn: true]}, :infinity)
         ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
     code: retrying(fn ->
     stacktrace:
       (stdlib) gen_server.erl:223: :gen_server.call/3
Warning:        (mix) lib/mix/tasks/profile.eprof.ex:197: Mix.Tasks.Profile.Eprof.profile_and_analyse/2
       (mix) lib/mix/tasks/profile.eprof.ex:186: Mix.Tasks.Profile.Eprof.profile/2
Warning:        (benchee) lib/benchee/benchmark/collect/profile.ex:17: Benchee.Benchmark.Collect.Profile.collect/2
       (benchee) lib/benchee/benchmark/runner.ex:291: Benchee.Benchmark.Runner.collect/3
Warning:        (benchee) lib/benchee/benchmark/run_once.ex:15: Benchee.Benchmark.RunOnce.run/3
       (elixir) lib/enum.ex:783: Enum."-each/2-lists^foreach/1-0-"/2
Warning:        (elixir) lib/enum.ex:783: Enum.each/2
       (benchee) lib/benchee/profile.ex:59: Benchee.Profile.profile/2
Warning:        (ex_unit) lib/ex_unit/capture_io.ex:151: ExUnit.CaptureIO.do_capture_io/2
       (ex_unit) lib/ex_unit/capture_io.ex:121: ExUnit.CaptureIO.do_capture_io/3
Warning:        test/benchee/profile_test.exs:166: anonymous fn/0 in Benchee.ProfileTest."test hooks all kinds of hooks with inputs work"/1
       (benchee) test/support/test_helpers.ex:17: Benchee.TestHelpers.retrying/2
Warning:        test/benchee/profile_test.exs:165: (test)

Which is weird, as the elixir code explicitly starts eprof (yes even in 1.9.4, I double checked, those are lines 196-197):

    :eprof.start()
    :eprof.profile([], fun, Keyword.get(opts, :matching, {:_, :_, :_}))

Some thoughts:

  • Now you could assume cross test taint but the test file under question runs with async: false.
  • The return values of :eprof.start() are also basically “started correctly” or “I was already started, here is my pid” so I don’t think checking these would help Erlang -- eprof
  • It seems to happen mostly with eprof, but that might be because eprof is our default profiler.
  • rerunning a failed run with the same seed also works (I was worried that a specific order of tests might crash eprof or something but since we start it again :person_shrugging: )

Basically I’m at wits end. I now started retrying these tests (finally correctly :see_no_evil: ) but that also doesn’t feel right. I hope I’m just missing something stupid, which I’d be happy for y’all to tell me about :grinning:

If anyone had any idea why these keep failing I’d be happy to hear them and fix it for real :grin:

Thanks!
Tobi

PS: cc @pablocostass if you have any idea :green_heart:

It doesn’t seem likely, but what about a race between one test’s :eprof.stop actually deregistering the process and the next test’s :eprof.start checking for that registration?

That could result in :eprof.start returning {:error, {:already_started, PID}} with a PID that is already stopping. :thinking:

1 Like

(sorry for the late response)

That seems like a good idea. I had hoped that :eprof.stop() was sync/waited for all to finish but maybe it isn’t :thinking:

The docs say nothing about it being sync or async, however the type signature makes it look like it’d be sync stop() -> stopped

I’ll check it out, thanks!

Unless I’m missing something it all looks really sync to me although I’m not sure in which order exactly GenServer itself handles stuff…

stop()  -> gen_server:call(?MODULE, stop, infinity).

handle_call(stop, _FromTag, S) ->
    {stop, normal, stopped, S}.

My reading - stop() will return when the GenServer replies, which is done right after calling terminate:

but the name won’t be deregistered until the GenServer’s process exits.

1 Like

Interesting - thank you!

So, checking and waiting in case the process isn’t shut down at the end of the test might solve it - I’ll give that a shot.

Unless you got any other ideas which I’d be happy to hear.

Thank you very much :green_heart: