Eprof sometimes claims it isn't started although it was just started (flaky tests)

:wave:

Hey everyone,

ever since the introduction of profiling into Benchee the test suite has experienced flakiness.

We basically call the built-in tasks like this Eprof task in Elixir.

Now, with 40 build configuration running, quite often we get a failure like this one:

  1) test hooks all kinds of hooks with inputs work (Benchee.ProfileTest)
Error:      test/benchee/profile_test.exs:164
     ** (exit) exited in: :gen_server.call(:eprof, {:profile_start, [], {:_, :_, :_}, {:erlang, :apply, [#Function<6.45063716/0 in Benchee.Benchmark.Runner.main_function/2>, []]}, [set_on_spawn: true]}, :infinity)
         ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
     code: retrying(fn ->
     stacktrace:
       (stdlib) gen_server.erl:223: :gen_server.call/3
Warning:        (mix) lib/mix/tasks/profile.eprof.ex:197: Mix.Tasks.Profile.Eprof.profile_and_analyse/2
       (mix) lib/mix/tasks/profile.eprof.ex:186: Mix.Tasks.Profile.Eprof.profile/2
Warning:        (benchee) lib/benchee/benchmark/collect/profile.ex:17: Benchee.Benchmark.Collect.Profile.collect/2
       (benchee) lib/benchee/benchmark/runner.ex:291: Benchee.Benchmark.Runner.collect/3
Warning:        (benchee) lib/benchee/benchmark/run_once.ex:15: Benchee.Benchmark.RunOnce.run/3
       (elixir) lib/enum.ex:783: Enum."-each/2-lists^foreach/1-0-"/2
Warning:        (elixir) lib/enum.ex:783: Enum.each/2
       (benchee) lib/benchee/profile.ex:59: Benchee.Profile.profile/2
Warning:        (ex_unit) lib/ex_unit/capture_io.ex:151: ExUnit.CaptureIO.do_capture_io/2
       (ex_unit) lib/ex_unit/capture_io.ex:121: ExUnit.CaptureIO.do_capture_io/3
Warning:        test/benchee/profile_test.exs:166: anonymous fn/0 in Benchee.ProfileTest."test hooks all kinds of hooks with inputs work"/1
       (benchee) test/support/test_helpers.ex:17: Benchee.TestHelpers.retrying/2
Warning:        test/benchee/profile_test.exs:165: (test)

Which is weird, as the elixir code explicitly starts eprof (yes even in 1.9.4, I double checked, those are lines 196-197):

    :eprof.start()
    :eprof.profile([], fun, Keyword.get(opts, :matching, {:_, :_, :_}))

Some thoughts:

  • Now you could assume cross test taint but the test file under question runs with async: false.
  • The return values of :eprof.start() are also basically “started correctly” or “I was already started, here is my pid” so I don’t think checking these would help Erlang -- eprof
  • It seems to happen mostly with eprof, but that might be because eprof is our default profiler.
  • rerunning a failed run with the same seed also works (I was worried that a specific order of tests might crash eprof or something but since we start it again :person_shrugging: )

Basically I’m at wits end. I now started retrying these tests (finally correctly :see_no_evil: ) but that also doesn’t feel right. I hope I’m just missing something stupid, which I’d be happy for y’all to tell me about :grinning:

If anyone had any idea why these keep failing I’d be happy to hear them and fix it for real :grin:

Thanks!
Tobi

PS: cc @pablocostass if you have any idea :green_heart:

It doesn’t seem likely, but what about a race between one test’s :eprof.stop actually deregistering the process and the next test’s :eprof.start checking for that registration?

That could result in :eprof.start returning {:error, {:already_started, PID}} with a PID that is already stopping. :thinking: