Using Ecto sanbox to build end-to-end tests

We are having trouble either in understanding or using Ecto sandbox in our end-to-end testing.

Our setup is a Phoenix server (used as a json API) and a single page app (Emberjs) serving as the client. In CI we boot up the Phoenix server and run end-to-end tests (using playwright) against the client app, allowing us to test the full interaction from client through to the server.

We have this setup working reliably when running test serially however, we face issues when trying to run the tests in parallel.

For reference, I have uploaded an example Phoenix app demonstrating the sandbox setup of the backend (see ecto-sandbox-example) and I will highlight the main approach here as well.

Before each test is run, the test will make a request to the API to start a new genserver to manage and checkout a new db connection.

# Start the genserver responsible for managing the connection
# https://github.com/swelham/ecto-sandbox-example/blob/main/lib/parallel_sandbox_web/controllers/sandbox_controller.ex
DynamicSupervisor.start_child(
  ParallelSandbox.SandboxConnSupervisor,
  {ParallelSandbox.SandboxDbConn, [group_id: group_id]} # Group id is passed in by the end-to-end test
)

# This will then hit the genserver handle_continue
# https://github.com/swelham/ecto-sandbox-example/blob/main/lib/parallel_sandbox/sandbox_db_conn.ex
def handle_continue(:open_conn, state) do
  :ok = Sandbox.checkout(ParallelSandbox.Repo)

  {:noreply, state}
end

Then, for each request the end-to-end makes, the the same group_id value used above
will be attached as a header. A plug picks up on this header and uses it to lookup which db connection should be used for the Repo queries during this request using Sandbox.allow/3.

# Finds the group_id value fron the conn and calls into the genserver started above to allow the sharing the connection with this process
# https://github.com/swelham/ecto-sandbox-example/blob/main/lib/parallel_sandbox_web/plugs/sandbox_plug.ex
with group_id when is_binary(group_id) <- get_group_id(conn) do
  SandboxDbConn.use(group_id)
end

# This will then call into the genserver assigned to that group_id and call Sandbox.allow for the caller's pid.
# https://github.com/swelham/ecto-sandbox-example/blob/main/lib/parallel_sandbox/sandbox_db_conn.ex
def use(group_id) do
  owner_process = pid_for_group_id(group_id)

  GenServer.call(owner_process, {:use, self()})
end

def handle_call({:use, caller}, _, state) do
   Sandbox.allow(ParallelSandbox.Repo, self(), caller)

   {:reply, :ok, state}
end

Once the test has finished, it will make a final request to checkin the db connection using the group_id.

# The delete request closes the db conn and terminates the genserver
# https://github.com/swelham/ecto-sandbox-example/blob/main/lib/parallel_sandbox_web/controllers/sandbox_controller.ex
def delete(conn, %{"id" => group_id}) do
  SandboxDbConn.close(group_id)

  owner = SandboxDbConn.pid_for_group_id(group_id)

  DynamicSupervisor.terminate_child(
    ParallelSandbox.SandboxConnSupervisor,
    owner
  )
...

# This will then end up calling the genserver handle_call for :close, which will checkin the Repo
# https://github.com/swelham/ecto-sandbox-example/blob/main/lib/parallel_sandbox/sandbox_db_conn.ex
def handle_call(:close, _, state) do
  :ok = Sandbox.checkin(ParallelSandbox.Repo)

  {:reply, :ok, state}
end

As I mentioned above this works great when running a single test at a time. However, when running tests concurrently we consistently see the following errors occurring at (seemingly) random intervals throughout the test.

Error 1

11:20:15.666 [error] Postgrex.Protocol (#PID<0.4330.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.5648.0> timed out because it queued and checked out the connection for longer than 15000ms

Client #PID<0.5648.0> is still using a connection from owner at location:

    :prim_inet.recv0/3
    (postgrex 0.17.2) lib/postgrex/protocol.ex:3188: Postgrex.Protocol.msg_recv/4
    (postgrex 0.17.2) lib/postgrex/protocol.ex:2213: Postgrex.Protocol.recv_bind/3
    (postgrex 0.17.2) lib/postgrex/protocol.ex:2193: Postgrex.Protocol.rebind_execute/4
    (ecto_sql 3.10.1) lib/ecto/adapters/sql/sandbox.ex:375: Ecto.Adapters.SQL.Sandbox.Connection.proxy/3
    (db_connection 2.5.0) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
    (db_connection 2.5.0) lib/db_connection.ex:1432: DBConnection.run_execute/5
    (db_connection 2.5.0) lib/db_connection.ex:1527: DBConnection.run/6

The connection itself was checked out by #PID<0.5648.0> at location:

    (postgrex 0.17.2) lib/postgrex.ex:340: Postgrex.query/4

Error 2

11:20:43.223 [error] #PID<0.6734.0> running OurAppWeb.Endpoint (connection #PID<0.6732.0>, stream id 2) terminated
Server: localhost:5002 (http)
Request: GET /api/our-endpoint/b133ad35-f99b-4e54-8bbf-40a74168085d
** (exit) an exception was raised:
    ** (DBConnection.ConnectionError) could not checkout the connection owned by #PID<0.6734.0>. When using the sandbox, connections are shared, so this may imply another process is using a connection. Reason: connection not available and request was dropped from queue after 270ms. You can configure how long requests wait in the queue using :queue_target and :queue_interval. See DBConnection.start_link/2 for more information
        (ecto_sql 3.10.1) lib/ecto/adapters/sql.ex:913: Ecto.Adapters.SQL.raise_sql_call_error/1
        (ecto_sql 3.10.1) lib/ecto/adapters/sql.ex:828: Ecto.Adapters.SQL.execute/6
        (ecto 3.10.3) lib/ecto/repo/queryable.ex:229: Ecto.Repo.Queryable.execute/4
        (ecto 3.10.3) lib/ecto/repo/queryable.ex:19: Ecto.Repo.Queryable.all/3
        (ecto 3.10.3) lib/ecto/repo/queryable.ex:151: Ecto.Repo.Queryable.one/3
        (our_app 0.1.0) lib/our_app/repo.ex:13: our_app.Repo.fetch_one/2
        (our_app 0.1.0) lib/our_app_web/controllers/api/our_endpoint_controller.ex:17: OurAppWeb.Api.OurEndpointController.show/2
        (our_app 0.1.0) lib/out_app_web/controllers/api/our_endpoint_controller.ex:1: OurAppWeb.Api.OurEndpointController.action/2

Any guidence on our approach/implementation of this would be massively appriciated.

I didn’t read your example thoroughly (sorry, short on time), but I feel like you’re trying to recreate this functionality. Maybe that link or this one could help you.

Any reason not to use Phoenix.Ecto.SQL.Sandbox — Phoenix/Ecto v4.4.2?

2 Likes

Any reason not to use Phoenix.Ecto.SQL.Sandbox — Phoenix/Ecto v4.4.2

Thanks, I never saw this!

I will take a look through those docs.

1 Like

@swelham Please let us know if you ever get parallel tests working using [ Phoenix.Ecto.SQL.Sandbox.

Our experience with Phoenix.Ecto.SQL.Sandbox with playwright is that the sandbox never permits more than one sandbox checkout at a time, so we have to disable parallelism in our end to end test. Here some snippets you might find useful:

in endpoint.ex

if Application.compile_env!(:app, :is_e2e) do
  plug AppWeb.EnsureSandbox

  plug Phoenix.Ecto.SQL.Sandbox,
    at: "/sandbox",
    repo: App.Repo,
    timeout:
      Application.compile_env(:app, :phoenx_ecto_sandbox_timeout),
    # Because the websocket only allows us to entract `x-*` headers from the initial connection
    # See https://hexdocs.pm/phoenix/Phoenix.Endpoint.html#socket/3-connect-info
    header: "sandbox"
end

in ensure_sandbox.ex

defmodule AppWeb.EnsureSandbox do
  @moduledoc """
  Ensures any request that doesn't have a sandbox id header while we are in e2e
  mode, will error out with a 400.
  """

  @behaviour Plug

  require Logger

  @sandbox_header_key "sandbox"

  alias Plug.Conn
  alias App.{Repo}

  def init(_) do
    []
  end

  def call(%{request_path: "/health" <> _path} = conn, _opts) do
    conn
    |> Phoenix.Controller.text("ok")
    |> Conn.halt()
  end

  def call(%{request_path: "/phoenix/" <> _path} = conn, _opts) do
    conn
  end

  def call(%{request_path: "/__/fonts" <> _path} = conn, _opts) do
    conn
  end

  def call(%{request_path: "/socket" <> _path} = conn, _opts) do
    conn
  end

  def call(%{method: "POST", request_path: "/sandbox"} = conn, []) do
    checkout_sandbox(conn)
  end

  def call(%{method: "GET", request_path: "/__e2e__/reset"} = conn, []) do
    change_sandbox_mode_from_auto_to_manual_before_send(conn)
  end

  def call(
        %{
          method: "POST",
          request_path: "/__e2e__/gq",
          query_string: "introspection=true"
        } = conn,
        []
      ) do
    change_sandbox_mode_from_auto_to_manual_before_send(conn)
  end

  def call(conn, []) do
    {conn, {sandbox_id, _repo_pid} = sandbox_id_and_owner_pid} =
      retrieve_sandbox_id(conn)

    case {conn.method, conn.request_path} do
      {"DELETE", "/sandbox"} ->
        resolve_sandbox_checkin(conn, sandbox_id)

      {method, path} ->
        ensure_sandbox_id(conn, method, path, sandbox_id_and_owner_pid)
    end
  end

  defp checkout_sandbox(conn) do
    Conn.register_before_send(conn, fn conn ->
      sandbox_id = conn.resp_body

      Logger.warn([
        "Checked out sandbox session: ",
        sandbox_id
      ])

      conn
    end)
  end

  defp change_sandbox_mode_from_auto_to_manual_before_send(conn) do
    Ecto.Adapters.SQL.Sandbox.mode(Repo, :auto)

    Conn.register_before_send(conn, fn conn ->
      Ecto.Adapters.SQL.Sandbox.mode(Repo, :manual)
      conn
    end)
  end

  defp ensure_sandbox_id(conn, method, path, {nil, _pid}) do
    Logger.error([
      "Attempting to ",
      method,
      " ",
      path,
      " but no sandbox in headers, sent 400"
    ])

    conn
    |> Conn.send_resp(
      400,
      "Provide a valid sandbox header"
    )
    |> Conn.halt()
  end

  defp ensure_sandbox_id(conn, method, path, {sandbox_id, _repo_owner_pid}) do
    log_sandbox_id_for_request(method, path, sandbox_id)

    put_sandbox_id_in_session(conn, sandbox_id)
  end

  defp retrieve_sandbox_id(conn) do
    cond do
      sandbox_id_and_owner_pid =
          Conn.get_req_header(conn, @sandbox_header_key)
          |> List.wrap()
          |> List.first()
          |> verify_sandbox_id() ->
        {conn, sandbox_id_and_owner_pid}

      sandbox_id_and_owner_pid =
          conn
          |> Conn.fetch_query_params()
          |> get_in([
            Access.key!(:params),
            Access.key(App.e2e_sandbox_query_key())
          ])
          |> verify_sandbox_id() ->
        {sandbox_id, _pid} = sandbox_id_and_owner_pid
        conn = Conn.put_req_header(conn, @sandbox_header_key, sandbox_id)
        {conn, sandbox_id_and_owner_pid}

      true ->
        sandbox_id_and_owner_pid = {nil, nil}
        {conn, sandbox_id_and_owner_pid}
    end
  end

  defp verify_sandbox_id(nil), do: nil

  defp verify_sandbox_id(sandbox_id) do
    Logger.warn([
      "Verifying sandbox ID: ",
      sandbox_id
    ])

    case Phoenix.Ecto.SQL.Sandbox.decode_metadata(sandbox_id) do
      %{owner: repo_owner_pid} ->
        try do
          if Process.alive?(repo_owner_pid) do
            {sandbox_id, repo_owner_pid}
          end
        rescue
          _ ->
            nil
        end

      _ ->
        nil
    end
  end

  defp resolve_sandbox_checkin(conn, nil) do
    Logger.error([
      "Attempting to check in a sandbox session, ",
      "but no sandbox in headers, ",
      "or sandbox process is not alive. ",
      "Sending 400"
    ])

    conn
    |> Conn.send_resp(
      400,
      "Provide a valid sandbox header"
    )
    |> Conn.halt()
  end

  defp resolve_sandbox_checkin(conn, sandbox_id) do
    Logger.warn([
      "Checking in sandbox session: ",
      sandbox_id
    ])

    conn
  end

  defp log_sandbox_id_for_request(method, path, sandbox_id) do
    Logger.warn([
      method,
      " ",
      path,
      " [ Sandbox: ",
      sandbox_id || "none",
      " ] "
      # "[ PID: ",
      # inspect(self()),
      # " ]"
    ])
  end

  # this one is for use with live view
  defp put_sandbox_id_in_session(conn, sandbox_id) do
    opts = AppWeb.Endpoint.session_options() |> Plug.Session.init()

    conn
    |> Plug.Session.call(opts)
    |> Conn.fetch_session()
    |> Conn.put_session(:__sandbox_id, sandbox_id)
  end
end

There have been reports of successful setups of that stuff at the time the functionality was was introduced.

2 Likes

Yes we use Ecto Sandbox with test concurrency all the time, works great! Perhaps the OP has a process in the supervision tree that’s using database connections too?

Yeah, the same constraints as with any other concurrent tests (even plain exunit ones) using the sandbox still apply:

https://hexdocs.pm/ecto_sql/3.10.2/Ecto.Adapters.SQL.Sandbox.html#module-collaborating-processes

1 Like

Yes concurrent tests work in ExUnit tests all the time. But for whatever reason, we never get concurrent tests to work using cypress or playwright. If someone has an example with those end to end test frameworks will be highly appreciated.

Can you elaborate on the issues you are running into? I am not familiar with either cyress or playwright but the ecto sandbox plug mechanic doesn’t rely on any specific framework. As long as you can configure them to acquire the sandbox session and guarantee they use it on each request that should be it.

Any reason not to use Phoenix.Ecto.SQL.Sandbox — Phoenix/Ecto v4.4.2

Thanks for pointing that out. I had somehow never found that before!

I have removed our custom implementation and replaced it with the built-in version and confirmed this is working as expected when running our tests one at a time.

Interestingly, I am still seeing the same error when running in parallel.

17:36:43.759 [error] Postgrex.Protocol (#PID<0.1874.0>) disconnected: ** (DBConnection.ConnectionError) #PID<0.2720.0> checked in the connection owned by #PID<0.2720.0>

Client #PID<0.2836.0> is still using a connection from owner at location:

    :prim_inet.recv0/3
    (postgrex 0.17.2) lib/postgrex/protocol.ex:3188: Postgrex.Protocol.msg_recv/4
    (postgrex 0.17.2) lib/postgrex/protocol.ex:2213: Postgrex.Protocol.recv_bind/3
    (postgrex 0.17.2) lib/postgrex/protocol.ex:2138: Postgrex.Protocol.bind_execute/4
    (ecto_sql 3.10.1) lib/ecto/adapters/sql/sandbox.ex:375: Ecto.Adapters.SQL.Sandbox.Connection.proxy/3
    (db_connection 2.5.0) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
    (db_connection 2.5.0) lib/db_connection.ex:1432: DBConnection.run_execute/5
    (db_connection 2.5.0) lib/db_connection.ex:1527: DBConnection.run/6

The connection itself was checked out by #PID<0.2836.0> at location:

    (postgrex 0.17.2) lib/postgrex.ex:340: Postgrex.query/4

I’m not sure I fully understand what this error is telling me, but I’m sensing some form of clash/race condition maybe.

I will try to build a reproduction with playwright and update the repo I linked to in the original post.

Yes, this is true, but they’re not being referenced by the tests themselves. I could definitely disable these during end-to-end testing and nothing should change from the test’s perspective.