Tests randomly failing when run in a VM

ybycode · April 1, 2018, 3:09pm

Hi guys,

I’m running code in docker containers for development and releases. I have a docker-compose configuration to start the database and run tests against it, and it works really well, but… When I run the same docker-compose, using the same images, in Gitlab-ci or in a Virtualbox VM on my laptop, I get something like up to 1% of my tests failing. If I run the tests over and over, I get errors 95% of the time, and not on the same tests.

The docker image I use for my tests is based on “bitwalker/alpine-elixir-phoenix” (so with Alpine Linux).
Running the same tests with a docker image based on ubuntu, I get no error. I find strange that Alpine in a container could behave differently depending on the host, but that’s the only beginning of explanation I have thus far. Does it make any sense?

To recap, here a the different configurations I tried, running the exact same code, several times in a row:

docker-compose, alpine based image, running locally on my laptop: no failing test,
docker-compose, alpine based image, in a VM (Virtualbox on my laptop, or in Gitlab-ci): tests fail randomly,
docker-compose, ubuntu based image, in a VM (Virtualbox on my laptop, or in Gitlab-ci): no failing test.

Any help would be greatly appreciated!

Thanks

idi527 · April 1, 2018, 3:33pm

Can you show the errors with which the tests fail?

ybycode · April 1, 2018, 4:45pm

Sure, here are a few:

[error] an exception was raised:
    ** (DBConnection.OwnershipError) cannot find ownership process for #PID<0.2025.0>.

,

test/app_web/channels/notification_channel_test.exs:191
** (MatchError) no match of right hand side value: {:error, %{reason: "join crashed"}}
stacktrace:
  test/app_web/channels/notification_channel_test.exs:27: AppWeb.NotificationChannelTest.__ex_unit_setup_1/1
  test/app_web/channels/notification_channel_test.exs:1: AppWeb.NotificationChannelTest.__ex_unit__/2

,

test/app/messages/message_boards_supervisor_test.exs:41
** (exit) exited in: GenServer.call({:via, Registry, {:messages_boards_registry, "team:df3bfc72-379b-4d95-b9e8-131c85c1e93e"}}, {:get, 40}, 5000)
    ** (EXIT) an exception was raised:
        ** (DBConnection.OwnershipError) cannot find ownership process for #PID<0.2956.0>.

,

test/app/messages/message_boards_supervisor_test.exs:41
GenServer {:messages_boards_registry, "team:d8bade3c-8432-4ea0-8bf2-8af5b2a6d62a"} terminating
** (DBConnection.OwnershipError) cannot find ownership process for #PID<0.2930.0>.

,

test/app/messages/message_board_test.exs:77
** (MatchError) no match of right hand side value: {:error, {:already_started, #PID<0.2372.0>}}
stacktrace:
  test/app/messages/message_board_test.exs:12: App.MessageBoardTest.__ex_unit_setup_1/1
  test/app/messages/message_board_test.exs:1: App.MessageBoardTest.__ex_unit__/2

Almost all tests fail due to this DBConnection.OwnershipError.

In this last error however it looks like the test is started before the process created by the previous test died, which usually doesn’t happen since they are run synchronously. Here’s the relative code:

  use App.DataCase

  alias App.MessageBoard

  @name :test_server

  setup do
    project_id = UUID.uuid4()
    {:ok, _} = MessageBoard.start_link(project_id, [name: @name])
    {:ok, project_id: project_id}
  end

sneako · April 3, 2018, 12:24am

Looks like you need to use Ecto’s sandbox https://hexdocs.pm/ecto/Ecto.Adapters.SQL.Sandbox.html

ybycode · April 3, 2018, 9:20am

Thanks sneako for your reply.

I didn’t say but that’s a phoenix app, and those failing tests use Ecto’s sandbox in shared mode, by using the DataCase behaviour (that comes out the box with a phoenix app):

(in test/support/data_case.ex):

  setup tags do
    :ok = Ecto.Adapters.SQL.Sandbox.checkout(App.Repo)

    unless tags[:async] do
      Ecto.Adapters.SQL.Sandbox.mode(App.Repo, {:shared, self()})
    end

    :ok
  end

Plus the code is the exact same between configurations…

mbaeuerle · April 13, 2018, 9:38am

We experience the same issue but in our case this also happens outside a VM when the machine running the tests is fast enough.
In our case this happens when a worker process is using the database connection but even when waiting in the test for the worker to finish, the ** (DBConnection.OwnershipError) cannot find ownership process error arises.

At the moment I am preparing a minimal test case to reproduce the error. I will post it here when it’s finished.

mbaeuerle · April 13, 2018, 1:25pm

Here is the test case: https://github.com/semlabs/ecto_sandbox
and here you can see the failing test after a couple of tries: https://semaphoreci.com/semlabs/ecto_sandbox/branches/master/builds/25

It’s a very simple app:

Create supervised workers which run as GenServers (Module Worker)
Make a request with a specific worker name to a POST route sending some data (Module CollectorController)
The worker processes and stores the data in the database (Module Server)
Make sure the current test waits until the data was stored (Module Notification)
Enable ownership logging for db_connection
The test uses ConnCase which sets the ownership mode to shared

As you can see the test randomly fails with:

** (DBConnection.OwnershipError) cannot find ownership process for #PID<0.438.0>.

When using ownership, you must manage connections in one
of the four ways:

* By explicitly checking out a connection
* By explicitly allowing a spawned process
* By running the pool in shared mode
* By using :caller option with allowed process

The first two options require every new process to explicitly
check a connection out or be allowed by calling checkout or
allow respectively.

The third option requires a {:shared, pid} mode to be set.
If using shared mode in tests, make sure your tests are not
async.

The fourth option requires [caller: pid] to be used when
checking out a connection from the pool. The caller process
should already be allowed on a connection.

If you are reading this error, it means you have not done one
of the steps above or that the owner process has crashed.

See Ecto.Adapters.SQL.Sandbox docs for more information.
    (db_connection) lib/db_connection.ex:934: DBConnection.checkout/2
    (db_connection) lib/db_connection.ex:750: DBConnection.run/3
    (db_connection) lib/db_connection.ex:592: DBConnection.prepare_execute/4
    (ecto) lib/ecto/adapters/postgres/connection.ex:86: Ecto.Adapters.Postgres.Connection.execute/4
    (ecto) lib/ecto/adapters/sql.ex:256: Ecto.Adapters.SQL.sql_call/6
    (ecto) lib/ecto/adapters/sql.ex:542: Ecto.Adapters.SQL.struct/8
    (ecto) lib/ecto/repo/schema.ex:547: Ecto.Repo.Schema.apply/4
    (ecto) lib/ecto/repo/schema.ex:213: anonymous fn/14 in Ecto.Repo.Schema.do_insert/4
    (ecto_sandbox) lib/ecto_sandbox/workers/server.ex:20: EctoSandbox.Workers.Worker.Server.create/1
    (ecto_sandbox) lib/ecto_sandbox/workers/server.ex:11: EctoSandbox.Workers.Worker.Server.store/2
    (ecto_sandbox) lib/ecto_sandbox/workers/server.ex:31: EctoSandbox.Workers.Worker.Server.handle_call/3
    (stdlib) gen_server.erl:636: :gen_server.try_handle_call/4
    (stdlib) gen_server.erl:665: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.436.0>): {:process, %{"data" => "test"}}

Notice that the test runs with #PID<0.436.0> as seen in the last line.
And a little bit before we can see the ownership changes. So the test process loses its proxy before the worker gets it:

.13:01:06.667 [error] #PID<0.436.0> owns proxy #PID<0.437.0>
13:01:06.671 [error] #PID<0.428.0>, #PID<0.430.0> lose proxy #PID<0.429.0>
13:01:06.731 [error] #PID<0.436.0> lose proxy #PID<0.437.0>

And that’s even before the test ends.

josevalim · April 16, 2018, 1:21pm

I believe the issue is that you are cleaning up the workers inside on_exit but then you have a race between executing the on_exit callback and the worker trying to use the connection to perform a database query. You need to make sure to terminate it inline, after the test.

mbaeuerle · April 16, 2018, 2:49pm

@josevalim thanks for your response!
At first I also thought it is a race condition. But then I added the notification system (via PhoenixPubSub) which notifies the subscribers here after performing the insert. As this is the only database operation of that specific worker it should not use the connection afterwards.
And because the tests wait for the worker to insert the data here the cleanup should happen after the workers did finish their job.
To verify this I removed the cleanup. As you can see here the error persists: https://semaphoreci.com/semlabs/ecto_sandbox/branches/master/builds/26

josevalim · April 17, 2018, 6:10am

You need to keep the clean-up. You have to invoke it at the end of each test though:

test "Test 1" do
  ...
after
  cleanup...
end

mbaeuerle · April 17, 2018, 7:44am

I added the cleanup after each test and also added an output to check if the test reaches the end:

test "Test 1", %{conn: conn, worker: worker} do
  conn = post(conn, "/collectors/#{worker.id}", ~s({"data": "test"}))

  assert conn.status == 201
  assert_receive {:chunk_stored, "stored"}
  IO.puts("Finished test 1")
after
  WorkerCleaner.cleanup_dynamic_workers()
end

Unfortunately it does not circumvent the problem: https://semaphoreci.com/semlabs/ecto_sandbox/branches/master/builds/28
Interestingly the output finished test 1 is missing when test 1 is failing. So the test doesn’t even finish correctly?
And as before the test (#PID<0.435.0>) loses its connection even before the worker process can handle the call:

.07:31:37.545 [error] #PID<0.435.0> owns proxy #PID<0.436.0>
07:31:37.550 [error] #PID<0.428.0>, #PID<0.430.0> lose proxy #PID<0.429.0>
07:31:37.607 [error] #PID<0.439.0> owns proxy #PID<0.440.0>
07:31:37.607 [error] #PID<0.435.0> lose proxy #PID<0.436.0>
07:31:37.616 [error] GenServer {Reg, "6b7f29ea-c6e2-4ece-881f-78a09fdb5fe4"} terminating
** (DBConnection.OwnershipError) cannot find ownership process for #PID<0.437.0>.

josevalim · April 17, 2018, 8:19am

You need to call DynamicSupervisor terminate_child and not the Supervisor one. Maybe we should make those calls send different messages to make sure they don’t work by accident.

EDIT: Sorry, nevermind, it shouldn’t make a difference.

It is not clear how to solve this issue then and I cannot reproduce it locally.

mbaeuerle · April 17, 2018, 8:52am

@josevalim thanks for sticking to this issue!
I cannot reproduce this issue locally myself. It seems there is some timing issue which only appears when the machine is very fast. We experienced this running mix test directly but also when running it in a docker container (both on semaphore).
I tried to dig into db_connection and its ownership part and I noticed that the owning process (the test) loses its connection because of shutdown (here: https://github.com/elixir-ecto/db_connection/blob/master/lib/db_connection/ownership/manager.ex#L159), but as I mentioned earlier this is before the test is getting to the end and before the GenServer uses it.

PS: If you think it is beneficial I can add you to the Semaphore project. There is the possibility to ssh to the machine and investigate.

mbaeuerle · April 27, 2018, 8:09am

@josevalim can we provide other information to help resolve this issue?

josevalim · April 27, 2018, 8:12am

@mbaeuerle without a mechanism to reproduce the issue there isn’t much I can do. Unfortunately I do not have enough time to ssh into the machine and reproduce it. A minimal test case would be ideal. I am sorry.

axelson · May 17, 2018, 12:12am

@mbaeuerle how reproducible is this error on CI for you? I think I just ran into it, but I’ve retried the build 10 more times without issue so I’m not sure how to replicate the issue.

mbaeuerle · May 17, 2018, 7:17am

@axelson that heavily depends. Sometimes there are several fails in a row and then I have to run it 10 or 20 times to get the error. So it sounds likely you are running into this issue.
In the minimal test case I set up the CI to repeat until it hits the error, in the last case this was on the third try: https://semaphoreci.com/semlabs/ecto_sandbox/branches/master/builds/26

axelson · September 14, 2018, 7:44pm

Just hit this twice in a row on CI. Very frustrating. We’re seeing an error maybe 1/20 times, seems to happen the most in our Phoenix channel tests.

josevalim · September 16, 2018, 9:55am

One thing you can try is to run those particular tests as:

elixir --erl "+T 9" -S mix test path/to/test/file

And see if it reproduces the issue more frequently. Basically, we need a mechanism to reproduce the issue consistently so we can investigate the root cause and work on a fix.

axelson · September 17, 2018, 1:58am

Thanks for the recommendation! I will try it out and report back.