Another race condition in ExUnit? Need help with finding it!

ColmB · May 29, 2024, 8:28am

We are upgrading our Elixir services from 14 to 16. We are seeing an issue in one service that seems very similar to https://github.com/elixir-lang/elixir/issues/12692 , though obviously that issue is fixed in 16. The key difference is that we are not seeing any ExUnit errors being logged.

The symptoms are basically the same - in some circumstances we see failed tests where the capture log is empty, for example …

 1) test Data In roster upload completed event no notification is published when the recipients list is empty (NotificationWorkers.RosterEventsTest)

     apps/notification_workers/test/notification_workers/roster_events_test.exs:384
     Expected truthy, got false
     code: assert String.contains?(
             log,
             "No roster notification published, no recipients were provided for roster: 99"
           )
     arguments:
         # 1
         ""
         # 2
         "No roster notification published, no recipients were provided for roster: 99"
     stacktrace:
       test/notification_workers/roster_events_test.exs:401: (test)

This happens consistently in our CI Jenkins environment and can be forced to happen in a local docker run by increasing --max-cases. Limiting max-cases to 4 (actually looks like only 1 will guarantee a pass) allows the CI run to pass.

With no error showing in the logs it is difficult to know where to start. Does anyone have suggestions for how to surface an error, or just further debug this?

ColmB · May 30, 2024, 3:18pm

Figured it out, and it isn’t a race condition, apologies for the title!

I eventually realised that the max-cases thing was a red herring, the tests were failing or passing randomly regardless of that setting. I tried --seed 0 and sure enough all tests passed consistently in CI. Grabbing the seed from failing tests I narrowed it down (over many hours of CI runs!) to this test:

    test "should handle >0 enrollments" do
      Application.put_env(:notification, :scanner_impl, MoreTestScanner)
      MockTopicPublisher
      |> expect(:publish_message, 3, fn _, _ -> {:ok} end)
      log = capture_log([level: :info], fn ->
        Logger.configure(level: :error)
        EndPublisher.scan_and_publish()
        Logger.configure(level: :error)
      end)
      assert String.contains?(log, "Found 3 participants")
    end

This is an old test and of course the person responsible is long gone so I can’t ask them what the thinking was here. I suspect that there was some weirdness around the capture_log level setting at the time and the two configures were attempting to get around that. Anyways, it is specifcally the Logger.configure(level: :error) that causes the loss of the capture server. I can place that line inside a capture_log in the module that has the actual failing test above and anything after that test will have no captured logs.

Adzz · May 31, 2024, 6:01pm

Application is essentially global mutable state, so updating it in an async test is not really recommended. If you had two tests running at the same time that both set the log level they could fight with each other which could lead to this.

dimitarvp · June 10, 2024, 8:31pm

Well, it can’t have been very good – sorry for not being diplomatic or generous here.

In such cases it’s best to refactor the code having this implementation being passed as a function argument and just be done with it. If it’s so disruptive and annoying (say, used in dozens of places) then you can always introduce a wrapper that puts the implementation in a process dictionary; that’s still bad but nowhere near as catastrophic as just overwriting global application config for just one test.

So whatever the original dev thought – disregard it. Make it right.

ColmB · June 11, 2024, 9:47am

All the tests in this service are actually async:false, but I take the general point.

What is interesting is that this worked prior to the upgrade. Playing around with a pre-upgrade checkout, it looks like this was because the order the test modules were run in didn’t change - and it just so happened that the module that had the Logger.configure change in it always came last. Since the upgrade, the order of the modules is now random and so sometimes the module that changes the Logger config is happening before the other module that also captures logs. This may be related to the big underlying change in this upgrade whereby map order is actually random rather than theoretically random.

Anyways, it is great to have understood what was happening and to be able to get back to finishing the upgrade

dimitarvp · June 11, 2024, 5:10pm

Make sure this is the assumption you are operating under (taken from ExUnit docs):

Note that we pass “async: true”, this runs the test case
concurrently with other test cases. The individual tests
within each test case are still run serially.

It seems that you do, I am mostly making sure your expectations are aligned.

But TBF Elixir’s Logger is a bit finicky, lately I had to contend with modernizing a hobby project and it definitely took me way longer than it should have. A lot of stuff is not obvious and that’s very closely related to Elixir trying to follow Erlang’s :logger closely which sadly led to a lot of controversial info. Generating a proper Logger configuration for the right Elixir version is probably a prime candidate for ChatGPT.