Eprof sometimes claims it isn't started although it was just started (flaky tests)


Hey everyone,

ever since the introduction of profiling into Benchee the test suite has experienced flakiness.

We basically call the built-in tasks like this Eprof task in Elixir.

Now, with 40 build configuration running, quite often we get a failure like this one:

  1) test hooks all kinds of hooks with inputs work (Benchee.ProfileTest)
Error:      test/benchee/profile_test.exs:164
     ** (exit) exited in: :gen_server.call(:eprof, {:profile_start, [], {:_, :_, :_}, {:erlang, :apply, [#Function<6.45063716/0 in Benchee.Benchmark.Runner.main_function/2>, []]}, [set_on_spawn: true]}, :infinity)
         ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
     code: retrying(fn ->
       (stdlib) gen_server.erl:223: :gen_server.call/3
Warning:        (mix) lib/mix/tasks/profile.eprof.ex:197: Mix.Tasks.Profile.Eprof.profile_and_analyse/2
       (mix) lib/mix/tasks/profile.eprof.ex:186: Mix.Tasks.Profile.Eprof.profile/2
Warning:        (benchee) lib/benchee/benchmark/collect/profile.ex:17: Benchee.Benchmark.Collect.Profile.collect/2
       (benchee) lib/benchee/benchmark/runner.ex:291: Benchee.Benchmark.Runner.collect/3
Warning:        (benchee) lib/benchee/benchmark/run_once.ex:15: Benchee.Benchmark.RunOnce.run/3
       (elixir) lib/enum.ex:783: Enum."-each/2-lists^foreach/1-0-"/2
Warning:        (elixir) lib/enum.ex:783: Enum.each/2
       (benchee) lib/benchee/profile.ex:59: Benchee.Profile.profile/2
Warning:        (ex_unit) lib/ex_unit/capture_io.ex:151: ExUnit.CaptureIO.do_capture_io/2
       (ex_unit) lib/ex_unit/capture_io.ex:121: ExUnit.CaptureIO.do_capture_io/3
Warning:        test/benchee/profile_test.exs:166: anonymous fn/0 in Benchee.ProfileTest."test hooks all kinds of hooks with inputs work"/1
       (benchee) test/support/test_helpers.ex:17: Benchee.TestHelpers.retrying/2
Warning:        test/benchee/profile_test.exs:165: (test)

Which is weird, as the elixir code explicitly starts eprof (yes even in 1.9.4, I double checked, those are lines 196-197):

    :eprof.profile([], fun, Keyword.get(opts, :matching, {:_, :_, :_}))

Some thoughts:

  • Now you could assume cross test taint but the test file under question runs with async: false.
  • The return values of :eprof.start() are also basically “started correctly” or “I was already started, here is my pid” so I don’t think checking these would help Erlang -- eprof
  • It seems to happen mostly with eprof, but that might be because eprof is our default profiler.
  • rerunning a failed run with the same seed also works (I was worried that a specific order of tests might crash eprof or something but since we start it again :person_shrugging: )

Basically I’m at wits end. I now started retrying these tests (finally correctly :see_no_evil: ) but that also doesn’t feel right. I hope I’m just missing something stupid, which I’d be happy for y’all to tell me about :grinning:

If anyone had any idea why these keep failing I’d be happy to hear them and fix it for real :grin:


PS: cc @pablocostass if you have any idea :green_heart:

It doesn’t seem likely, but what about a race between one test’s :eprof.stop actually deregistering the process and the next test’s :eprof.start checking for that registration?

That could result in :eprof.start returning {:error, {:already_started, PID}} with a PID that is already stopping. :thinking: