Intermittent ownership errors in CI, is there a way to make them crash the VM?

schnittchen · February 19, 2020, 9:55am

Hello there,

I have several tests which occasionally produce an ownership error which does not make the test appear as failed. I understand the possibilities of race conditions leading to these, but so far I was unable to pin down each case. It’s annoying that there’s a big red confusing crash report in the CI log, and for sake of a good signal-to-noise ratio I want to get rid of it.

I understand that these errors occur in the web connection process when the test process has already finished, so there is no way to mark the test as failed after the fact. What I would need is a way to customize the error handling such that an output is logged and the VM terminated with exit status 1, effectively turning irrelevant background noise into a hard failure, at least for the time I’m hunting these issues down.

Is there any way to accomplish this?

schnittchen · March 4, 2020, 1:54pm

Here’s what I’m trying right now. Currently it looks like all such errors have been eliminated, but next time the tests will crash and hopefully show be enough information to reproduce and fix it.

This is in our HoundCase (we use Hound for testing with browser integration), it saves the test tags where they can be found globally using the test PID (not elegant but OK for testing…)

setup tags do
  # We save the tags globally to help find tests which, while passing,
  # show sandbox exceptions. See MyApp.Endpoint on how this is picked up.
  # Yes, this code has a race condition, but that should not bother us.

  # Store tags for debugging ownership problems, see Endpoint!
  config_key = "ex_unit_tags_#{inspect(self())}"
  Application.put_env(:my_app, config_key, tags)

  :ok 
end

And this is in the Endpoint. It depends on some implementation details (for getting the test PID out of the ownership data):

if Application.get_env(:my_app, :debug_ownership?) do
  def call(conn, opts) do
    try do
      super(conn, opts)
    catch
      :error, %Plug.Conn.WrapperError{reason: %DBConnection.OwnershipError{}} = we ->
        conn 
        |> get_req_header("user-agent")
        |> List.first()
        |> Phoenix.Ecto.SQL.Sandbox.decode_metadata()
        |> case do
          %{owner: pid} ->
            config_key = "ex_unit_tags_#{inspect(pid)}"

            Application.get_env(:my_app, config_key)
            |> Map.take([:file, :line])
            |> IO.inspect(label: "ExUnit tags of next exception (fetched by sandbox owner)")

            IO.inspect(we)
            IO.inspect(conn)

            IO.puts("Ownership debugging is on, halting the VM on purpose.")
            :erlang.halt(1)

          _else ->
            :ok
        end

        reraise we, __STACKTRACE__
    end
  end

  defoverridable call: 2
end

Hope this is interesting