Hello everyone,
I have encountered a strange bottleneck when trying to run tests. The execution time of specific tests was increasing according to the number of schedulers online: each scheduler added about 100ms to the execution time.
CI Setup
I’m running CI tests using a GitLab runner on Amazon EC2 c6g instances. The runner has 32 CPUs.
Local Setup
8 CPUs, Mac M1.
Test Details
I have 2 partitions for tests, each with about 3.5k tests. Running the command:
MIX_TEST_PARTITION=1 elixir --erl "+S 8:8" -S mix test --partitions 2
The unstable test uses the test_server
library to open a socket on a random port with plug_cowboy
. There are no heavy computations inside the mocked request; it just responds with a hardcoded small JSON map. Two subsequent requests are made.
From the client side, I’m using Finch with Mint, with a pool size of 10.
Test Running Time
Below are the execution times for the unstable test when running locally on 8 CPUs:
| Single | 20ms |
|--------------|-------|
| During suite | 800ms |
I then experimented with the number of schedulers:
| 1 scheduler | ~100ms |
|--------------|--------|
| 2 schedulers | ~200ms |
| 3 schedulers | ~300ms |
| 8 schedulers | ~800ms |
Telemetry
To find the cause, I attached telemetry to finch
and cowboy
:
finch.queue.stop duration: 388ms : idle time: 0ms
finch.connect.stop duration: 982ms
finch.send.stop duration: 395ms : idle time: 0ms
cowboy.request.stop duration: 1916ms : req_body_duration: 0ms : resp_duration: 0ms
finch.recv.stop duration: 3248ms : idle time: 0ms
finch.request.stop duration: 5608ms
Given that no work is done inside dispatching except Plug.Conn.send_resp/3
, this seemed very strange.
Running fprof
{ suspend, 70, 680.139, 0.000}, %
[ ]}.
{[{{error_handler,undefined_function,3}, 1, 659.626, 0.003},
{{'Elixir.Acme.Operations.SF.IN.SyncServiceContracts',
request_page,2}, 1, 10.840, 0.002}],
....
{ {code_server,call,1}, 30, 383.299, 0.411}, %
[{suspend, 33, 382.695, 0.000},
{{erlang,demonitor,2}, 30, 0.152, 0.119},
{{erlang,monitor,2}, 30, 0.041, 0.041}]}.
{[{{error_handler,undefined_function,3}, 19, 360.224, 0.044}],
These are the most expensive nodes in the whole fprof logs. It seems the process gets descheduled and was suspended most of the time while code was loading into the memory inside Mix.Compilers.Test
Researching Similar Issue
After a while, I found a discussion and tried explicitly setting the max-requires
option. It turns out that on 8 CPUs, the test ran in less than 100ms with max-requires
set to 100. In Elixir core, it defaults to the maximum between the system schedulers online and 2.
However, the docs lack details about how this parameter can affect test execution time. Maybe someone can help explain the algorithm for calibrating it?