Updates!
Thanks so much for the ideas y’all! Interesting turn of events, I went digging into the DB side of this because we have addressed issues like transaction locks in the past and implemented unique values for constrained columns, but I think I misled us! 
Red Herring - DB timeout
My current hypothesis is that the DBConnection.ConnectionError) owner #PID<0.726.0> timed out because it owned the connection for longer than 120000ms
timeout is a redherring! It is timing out and worth looking into, but the DB isn’t the cause of the timeout, it’s just reporting that it’s been blocked for quite a while with an open connection(? I think).
I re-ran tests locally to discover that the cover
portion of the test command is using most of the unaccounted-for time!
### Just Test (21s)
> time MIX_DEBUG=1 mix test test/crowbar/scan/vciso_card_test.exs > tmp.txt && elixir ./analyze.exs
### Partitioned (21s)
> time MIX_TEST_PARTITION=1 MIX_DEBUG=1 mix test test/crowbar/scan/vciso_card_test.exs --partitions 2 > tmp.txt && elixir ./analyze.exs
### Cover (61s)
> time MIX_DEBUG=1 mix test test/crowbar/scan/vciso_card_test.exs --cover > tmp.txt && elixir ./analyze.exs
### Partition & Cover (73s)
> time MIX_TEST_PARTITION=1 MIX_DEBUG=1 mix test test/crowbar/scan/vciso_card_test.exs --partitions 2 --cover > tmp.txt && elixir ./analyze.exs
I put the analyze.exs into a Gist here if you want to recreate. Should be code base agnostic.
New Hypothesis: ExCoveralls is using the time!
I annotated ExCoveralls locally to reproduce and I found the time here
-> Running mix test test/crowbar/scan/vciso_card_test.exs --cover (inside Crowbar.Mixfile)
-> Running ExCoveralls Cover.compile
<- Ran ExCoveralls Cover.compile in 25757ms
…
Finished in 4.9 seconds (4.9s async, 0.00s sync)
…
-> Running ExCoveralls Cover.execute
<- Ran ExCoveralls Cover.execute in 13249ms
<- Ran mix test in 43973ms
This example just ran 1 test file, but the time breakdown is clear
44s total = 26s Cover.compile
+ 13s Cover.execute
+ 5s tests
note: These time are all much faster because it’s local on a multicore laptop vs. CI with the default github action runner
Mystery solved and if y’all happen to have any thoughts or experience on how to address things like this in CI, I’d love to hear your thoughts!
My likely next step is to make a --cover
Action for the end of CI, so we can run the tests and get feedback as fast as possible without coverage.