We are having trouble either in understanding or using Ecto sandbox in our end-to-end testing.
Our setup is a Phoenix server (used as a json API) and a single page app (Emberjs) serving as the client. In CI we boot up the Phoenix server and run end-to-end tests (using playwright) against the client app, allowing us to test the full interaction from client through to the server.
We have this setup working reliably when running test serially however, we face issues when trying to run the tests in parallel.
For reference, I have uploaded an example Phoenix app demonstrating the sandbox setup of the backend (see ecto-sandbox-example) and I will highlight the main approach here as well.
Before each test is run, the test will make a request to the API to start a new genserver to manage and checkout a new db connection.
# Start the genserver responsible for managing the connection
# https://github.com/swelham/ecto-sandbox-example/blob/main/lib/parallel_sandbox_web/controllers/sandbox_controller.ex
DynamicSupervisor.start_child(
ParallelSandbox.SandboxConnSupervisor,
{ParallelSandbox.SandboxDbConn, [group_id: group_id]} # Group id is passed in by the end-to-end test
)
# This will then hit the genserver handle_continue
# https://github.com/swelham/ecto-sandbox-example/blob/main/lib/parallel_sandbox/sandbox_db_conn.ex
def handle_continue(:open_conn, state) do
:ok = Sandbox.checkout(ParallelSandbox.Repo)
{:noreply, state}
end
Then, for each request the end-to-end makes, the the same group_id
value used above
will be attached as a header. A plug picks up on this header and uses it to lookup which db connection should be used for the Repo
queries during this request using Sandbox.allow/3
.
# Finds the group_id value fron the conn and calls into the genserver started above to allow the sharing the connection with this process
# https://github.com/swelham/ecto-sandbox-example/blob/main/lib/parallel_sandbox_web/plugs/sandbox_plug.ex
with group_id when is_binary(group_id) <- get_group_id(conn) do
SandboxDbConn.use(group_id)
end
# This will then call into the genserver assigned to that group_id and call Sandbox.allow for the caller's pid.
# https://github.com/swelham/ecto-sandbox-example/blob/main/lib/parallel_sandbox/sandbox_db_conn.ex
def use(group_id) do
owner_process = pid_for_group_id(group_id)
GenServer.call(owner_process, {:use, self()})
end
def handle_call({:use, caller}, _, state) do
Sandbox.allow(ParallelSandbox.Repo, self(), caller)
{:reply, :ok, state}
end
Once the test has finished, it will make a final request to checkin the db connection using the group_id
.
# The delete request closes the db conn and terminates the genserver
# https://github.com/swelham/ecto-sandbox-example/blob/main/lib/parallel_sandbox_web/controllers/sandbox_controller.ex
def delete(conn, %{"id" => group_id}) do
SandboxDbConn.close(group_id)
owner = SandboxDbConn.pid_for_group_id(group_id)
DynamicSupervisor.terminate_child(
ParallelSandbox.SandboxConnSupervisor,
owner
)
...
# This will then end up calling the genserver handle_call for :close, which will checkin the Repo
# https://github.com/swelham/ecto-sandbox-example/blob/main/lib/parallel_sandbox/sandbox_db_conn.ex
def handle_call(:close, _, state) do
:ok = Sandbox.checkin(ParallelSandbox.Repo)
{:reply, :ok, state}
end
As I mentioned above this works great when running a single test at a time. However, when running tests concurrently we consistently see the following errors occurring at (seemingly) random intervals throughout the test.
Error 1
11:20:15.666 [error] Postgrex.Protocol (#PID<0.4330.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.5648.0> timed out because it queued and checked out the connection for longer than 15000ms
Client #PID<0.5648.0> is still using a connection from owner at location:
:prim_inet.recv0/3
(postgrex 0.17.2) lib/postgrex/protocol.ex:3188: Postgrex.Protocol.msg_recv/4
(postgrex 0.17.2) lib/postgrex/protocol.ex:2213: Postgrex.Protocol.recv_bind/3
(postgrex 0.17.2) lib/postgrex/protocol.ex:2193: Postgrex.Protocol.rebind_execute/4
(ecto_sql 3.10.1) lib/ecto/adapters/sql/sandbox.ex:375: Ecto.Adapters.SQL.Sandbox.Connection.proxy/3
(db_connection 2.5.0) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
(db_connection 2.5.0) lib/db_connection.ex:1432: DBConnection.run_execute/5
(db_connection 2.5.0) lib/db_connection.ex:1527: DBConnection.run/6
The connection itself was checked out by #PID<0.5648.0> at location:
(postgrex 0.17.2) lib/postgrex.ex:340: Postgrex.query/4
Error 2
11:20:43.223 [error] #PID<0.6734.0> running OurAppWeb.Endpoint (connection #PID<0.6732.0>, stream id 2) terminated
Server: localhost:5002 (http)
Request: GET /api/our-endpoint/b133ad35-f99b-4e54-8bbf-40a74168085d
** (exit) an exception was raised:
** (DBConnection.ConnectionError) could not checkout the connection owned by #PID<0.6734.0>. When using the sandbox, connections are shared, so this may imply another process is using a connection. Reason: connection not available and request was dropped from queue after 270ms. You can configure how long requests wait in the queue using :queue_target and :queue_interval. See DBConnection.start_link/2 for more information
(ecto_sql 3.10.1) lib/ecto/adapters/sql.ex:913: Ecto.Adapters.SQL.raise_sql_call_error/1
(ecto_sql 3.10.1) lib/ecto/adapters/sql.ex:828: Ecto.Adapters.SQL.execute/6
(ecto 3.10.3) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4
(ecto 3.10.3) lib/ecto/repo/queryable.ex:19: Ecto.Repo.Queryable.all/3
(ecto 3.10.3) lib/ecto/repo/queryable.ex:151: Ecto.Repo.Queryable.one/3
(our_app 0.1.0) lib/our_app/repo.ex:13: our_app.Repo.fetch_one/2
(our_app 0.1.0) lib/our_app_web/controllers/api/our_endpoint_controller.ex:17: OurAppWeb.Api.OurEndpointController.show/2
(our_app 0.1.0) lib/out_app_web/controllers/api/our_endpoint_controller.ex:1: OurAppWeb.Api.OurEndpointController.action/2
Any guidence on our approach/implementation of this would be massively appriciated.