A linked process crashes: different behavior between test and development environments

marick · December 26, 2019, 9:11pm

When running tests, a linked process that crashes causes different behavior than when it’s running under mix phx.server (in development mode). One way suggests a linked process isn’t getting stopped when the process it links to does. The other has something to do with Ecto and sandboxes.

Note: I expect this is all intended behavior; I’m just curious what’s happening.

In my phoenix/ecto app, I have n “prefix server” processes, each of which handles queries to a particular Postgres “schema” in a single database/Repo. The n is only known at runtime, so there’s another process (Servers) that reads a database table of “institutions” and starts up a prefix server for each institution.

If there’s a bug in my product code that causes a prefix server to violate a Postgres constraint, it crashes, which causes the linked creating process (Servers) to crash, which causes all the remaining prefix servers to crash. This is fine with me because Servers gets restarted by its supervisor, so the whole tree gets rebuilt.

I had a bug in some test setup code that occasionally caused a test to violate a Postgres constraint. The result was that every later test would fail. Here’s an example from a test that attempts to insert two animals with the same name:

      Factory.sql_insert!(:animal, [name: "foo"], @institution)
      Factory.sql_insert!(:animal, [name: "foo"], @institution)

First the constraint error:

..........................................................................................................................*..............................................................
14:36:26.209 [error] GenServer #PID<0.374.0> terminating
** (Ecto.ConstraintError) constraint error when attempting to insert struct:

    * unique_available_names (unique_constraint)

That’s as expected.

In this particular example, the next test produces this:

    ** (exit) exited in: GenServer.call(Crit.Sql.Servers, {:server_for, "critter4us"}, 5000)
         ** (EXIT) no process: the process is not alive or 
         there's no process currently associated with 
         the given name, possibly because its application isn't started

I’m interpreting that to mean Servers did not crash because of the prefix server crash. (Servers is essentially a registry from institution name to prefix server pid, and so it’s directing its client to a dead process.)

Note that this test is in the same test module (Crit.Usables.AnimalImpl.ReadTest) as the originally failing test.

The next test, in the same module, fails in the same way.

The next test, also in the same module, fails differently, in DBConnection.Holder.checkout:

  3) test when fetching ids, missing ids are silently ignored (Crit.Usables.AnimalImpl.ReadTest)
     test/crit/usables/animal_impl/read_test.exs:123
     ** (exit) exited in: GenServer.call(Crit.Sql.Servers, {:server_for, "critter4us"}, 5000)
         ** (EXIT) exited in: DBConnection.Holder.checkout(#PID<0.1313.0>, [log: #Function<11.92802362/1 in Ecto.Adapters.SQL.with_log/3>, source: "institutions", timeout: 15000, pool_size: 10, pool: DBConnection.Ownership])
             ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started

I am not running async tests, but I am using the out-of-the-box setup for testing clients of Ecto.

Thereafter, many tests fail:

(RuntimeError) could not lookup Ecto repo Crit.Repo because it was not started or it does not exist

It doesn’t matter if the Repo-using test is in the original test module or a different test module.

Two questions:

Why didn’t the Server crash and get restarted?
Why the Ecto checkout error, which I’ve only ever seen in this case?

Details:

DataCase hasn’t been changed:

  setup tags do
    :ok = Ecto.Adapters.SQL.Sandbox.checkout(Crit.Repo)

    unless tags[:async] do
      Ecto.Adapters.SQL.Sandbox.mode(Crit.Repo, {:shared, self()})
    end

    :ok
  end

Note that I can’t use async: true on any of my database tests because it produces this:

.......................14:54:53.671 [error] GenServer #PID<0.374.0> terminating
** (DBConnection.OwnershipError) cannot find ownership process for #PID<0.374.0>.

… presumably because it’s a prefix server process doing the checkout. The resulting test slowness hasn’t annoyed me enough - yet - to do anything about fixing that.

Servers is started in Crit.Application.start in what I think is the usual way:

  def start(_type, _args) do
    children = [
      ...
      {Crit.Sql.Servers, name: Crit.Sql.Servers}
    ]
    opts = [strategy: :one_for_one, name: Crit.Supervisor]
    Supervisor.start_link(children, opts)

Server starts a prefix server in what I think is the usual way:

    # In Server
    {:ok, pid} = PrefixServer.start_link(prefix)

   # PrefixServer
   def start_link(prefix),
     do: GenServer.start_link(__MODULE__, prefix)