Unit tests are randomly timing out

Billzabob · July 14, 2023, 9:14pm

Our team is having some flaky unit tests that we’re really struggling to track down. Every once in a while a random unit test will time out. We have increased the depth of the stack trace in hopes of tracking it down but it’s still all over the place. We’d love some help here

A lot of the time they end with :error_handler.undefined_function/3 but it’s always deep within library code like Phoenix or Absinthe, which seems super strange.

Here are some of the stack traces we’re seeing:

pdgonzalez872 · July 14, 2023, 9:31pm

hi @Billzabob! Seems like the tests are taking more than 1 min? did you try the @tag thing it suggests?

Billzabob · July 14, 2023, 9:57pm

Hey @pdgonzalez872! Yes I did. They still timeout, unfortunately

I should also mention that this is mostly during our Gitlab pipeline. When running them locally the whole suite completes within about 6 seconds and it’s been super rare to see this timeout issue, although it’s happened once or twice. That is making it even harder to debug.

al2o3cr · July 14, 2023, 11:48pm

I don’t have the foggiest idea what this could be - but based on those stacktraces, I noticed an interesting thing.

Two of them point to code:ensure_loaded in error_handler.erl

AND

There’s this PR from Jose (released in OTP26, according to Github tags) that specifically mentions “contention on the code_server”:

That middle one in prim_inet.recv0 doesn’t fit that theory, tho

Unrelated to the above: is there anything unusual about the performance of the machines that CI is running on? I’ve seen weird stuff like this happen when running on “burstable” instances (when the CI instance runs out of CPU or IOPS), though you mention it also happening locally.

pdgonzalez872 · July 15, 2023, 12:47am

this still timed out

It timed out when set to infinity? Geez

There’s this PR from Jose (released in OTP26, according to Github tags) that specifically mentions “contention on the code_server”:

Nice pull! Would be awesome if this were the case!

I should also mention that this is mostly during our Gitlab pipeline

Much like @al2o3cr said, those machines are usually a lot slower. Locally, we have a lot more processing power and these rarely (if ever) happen.

Sorry, this is an annoying problem. Could be a race condition somewhere as well. Do you have a Task.start/1 somewhere (or similar) that you don’t mock and it gets called directly?

w0rd-driven · July 15, 2023, 2:09am

This advice could be pointless but what kind of Gitlab pipeline? Shared runners or dedicated? Is the CI image close to how you execute tests locally? Could you pull the same docker image and simulate pipelines locally? That’s not always trivial but my first thought is this may not be 1:1 at the OS level i.e. macOS vs potentially Alpine with musl instead of glibc. That it happens locally should not point to the docker image ideally but I’d be reaching into everything for something this hard to debug.