Ecto timeout errors when wrapping into cachex calls in tests

cjk · January 1, 2019, 1:21pm

Hi there,

I use Cachex to cache very frequent queries:

@spec get_setting_for_forum(%Forum{}) :: %Setting{} | nil
def get_setting_for_forum(%Forum{} = forum) do
  Cachex.fetch!(:cforum, "settings/forums/#{forum.forum_id}", fn ->
    from(setting in Setting, where: setting.forum_id == ^forum.forum_id)
    |> Repo.one()
  end)
end

@spec get_setting_for_user(%User{}) :: %Setting{} | nil
def get_setting_for_user(%User{} = user) do
  Cachex.fetch!(:cforum, "settings/users/#{user.user_id}", fn ->
    from(setting in Setting, where: setting.user_id == ^user.user_id)
    |> Repo.one()
  end)
end

This function loads a settings object from the database. It works fine in production and dev code, but in tests I get this error:

** (Cachex.ExecutionError) connection not available and request was dropped from queue after 602ms. You can configure how long requests wait in the queue using :queue_target and :queue_interval. See DBConnection.start_link/2 for more information

I call the getter functions like this:

def load_relevant_settings(forum, user)

def load_relevant_settings(nil, nil) do
  [get_global_setting()]
  |> Enum.reject(&is_nil(&1))
end

def load_relevant_settings(%Forum{} = forum, nil) do
  [get_global_setting(), get_setting_for_forum(forum)]
  |> Enum.reject(&is_nil(&1))
end

def load_relevant_settings(nil, %User{} = user) do
  [get_global_setting(), get_setting_for_user(user)]
  |> Enum.reject(&is_nil(&1))
end

def load_relevant_settings(%Forum{} = forum, %User{} = user) do
  [get_global_setting(), get_setting_for_forum(forum), get_setting_for_user(user)]
  |> Enum.reject(&is_nil(&1))
end

I can’t figure out what’s going wrong. Has anybody an idea?

Edit: I already tried to set the pool size to a very high value (100), but it didn’t help.

Best regards,
CK

cjk · January 1, 2019, 1:48pm

I think I figured out what happens. The full stacktrace is as follows:

      (cachex) lib/cachex.ex:1407: Cachex.unwrap_unsafe/1
       (cforum) lib/cforum/config_manager.ex:278: Cforum.ConfigManager.conf/3
       (cforum) lib/cforum/config_manager.ex:262: Cforum.ConfigManager.conf/3
       (cforum) lib/cforum/forums/message_indexer_job.ex:21: anonymous fn/2 in Cforum.Forums.MessageIndexerJob.index_message/2
       (cforum) lib/cforum/forums/messages.ex:524: Cforum.Forums.Messages.index_message/2
       (cforum) lib/cforum/forums/threads.ex:431: Cforum.Forums.Threads.create_message/5
       (ecto_sql) lib/ecto/adapters/sql.ex:814: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
       (db_connection) lib/db_connection.ex:1349: DBConnection.run_transaction/4
       (cforum) lib/cforum/forums/threads.ex:403: Cforum.Forums.Threads.create_thread/5
       (cforum) lib/cforum_web/controllers/thread_controller.ex:139: CforumWeb.ThreadController.create_thread/2
       (cforum) lib/cforum_web/controllers/thread_controller.ex:1: CforumWeb.ThreadController.action/2
       (cforum) lib/cforum_web/controllers/thread_controller.ex:1: CforumWeb.ThreadController.phoenix_controller_pipeline/2
       (cforum) lib/cforum_web/endpoint.ex:1: CforumWeb.Endpoint.instrument/4
       (phoenix) lib/phoenix/router.ex:275: Phoenix.Router.__call__/1
       (cforum) lib/cforum_web/endpoint.ex:1: CforumWeb.Endpoint.plug_builder_call/2
       (cforum) lib/cforum_web/endpoint.ex:1: CforumWeb.Endpoint.call/2
       (phoenix) lib/phoenix/test/conn_test.ex:235: Phoenix.ConnTest.dispatch/5
       test/cforum_web/controllers/thread_controller_test.exs:92: (test)

The application is in the middle of a database operation. Since the Ecto.Adapters.SQL.Sandbox adapter wraps every test case in a giant transaction there is only one ecto process available, right? This would then lead to a deadlock when requesting multiple times for a database connection, wouldn’t it?

How can I circumvent this?

Edit: but if this is the case, why does everything work fine when not wrapping the getter functions in Cachex.fetch?

Best regards,
CK

cjk · January 1, 2019, 8:27pm

I might have found the reason why it works when not using Cachex. Cachex spawns its own process, and the fallback function for querying the database gets executed in the context of the cachex process. And this process has no access to the database connection. Is this assumption right? If yes, how do I fix that w/o calling Ecto.Adapters.SQL.Sandbox.allow?

benwilson512 · January 1, 2019, 10:01pm

You can do:

def get_setting_for_forum(%Forum{} = forum) do
  caller = self()
  Cachex.fetch!(:cforum, "settings/forums/#{forum.forum_id}", fn ->
    from(setting in Setting, where: setting.forum_id == ^forum.forum_id)
    |> Repo.one(caller: caller)
  end)
end

cjk · January 2, 2019, 6:01am

Hi Ben,

sadly that doesn’t change anything. I still get the timeouts:

  1) test create creates a thread (CforumWeb.ThreadControllerTest)
     test/cforum_web/controllers/thread_controller_test.exs:85
     ** (Cachex.ExecutionError) connection not available and request was dropped from queue after 692ms. You can configure how long requests wait in the queue using :queue_target and :queue_interval. See DBConnection.start_link/2 for more information                                                                                                                                                                                                                                    
     code: conn = post(conn, Path.thread_path(conn, :new, forum), message: params_for(:message, forum_id: nil))
     stacktrace:
       (cachex) lib/cachex.ex:1407: Cachex.unwrap_unsafe/1
       (cforum) lib/cforum/config_manager.ex:278: Cforum.ConfigManager.conf/3
       (cforum) lib/cforum/config_manager.ex:262: Cforum.ConfigManager.conf/3
       (cforum) lib/cforum/forums/message_indexer_job.ex:21: anonymous fn/2 in Cforum.Forums.MessageIndexerJob.index_message/2
       (cforum) lib/cforum/forums/messages.ex:524: Cforum.Forums.Messages.index_message/2
       (cforum) lib/cforum/forums/threads.ex:431: Cforum.Forums.Threads.create_message/5
       (ecto_sql) lib/ecto/adapters/sql.ex:814: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
       (db_connection) lib/db_connection.ex:1349: DBConnection.run_transaction/4
       (cforum) lib/cforum/forums/threads.ex:403: Cforum.Forums.Threads.create_thread/5
       (cforum) lib/cforum_web/controllers/thread_controller.ex:139: CforumWeb.ThreadController.create_thread/2
       (cforum) lib/cforum_web/controllers/thread_controller.ex:1: CforumWeb.ThreadController.action/2
       (cforum) lib/cforum_web/controllers/thread_controller.ex:1: CforumWeb.ThreadController.phoenix_controller_pipeline/2
       (cforum) lib/cforum_web/endpoint.ex:1: CforumWeb.Endpoint.instrument/4
       (phoenix) lib/phoenix/router.ex:275: Phoenix.Router.__call__/1
       (cforum) lib/cforum_web/endpoint.ex:1: CforumWeb.Endpoint.plug_builder_call/2
       (cforum) lib/cforum_web/endpoint.ex:1: CforumWeb.Endpoint.call/2
       (phoenix) lib/phoenix/test/conn_test.ex:235: Phoenix.ConnTest.dispatch/5
       test/cforum_web/controllers/thread_controller_test.exs:86: (test)

This issue causes me headaches

cjk · January 2, 2019, 6:32am

I could do something like this:

  def fetch(key, fun, cache \\ :cforum) do
    case Cachex.exists?(cache, key) do
      {:ok, true} ->
        Cachex.get!(cache, key)

      {:ok, false} ->
        val = fun.()
        Cachex.put!(cache, key, val)
        val

      {:error, _} = v ->
        raise v
    end
  end

This solves the issue but this just circumvents the problem. I’d like to understand what’s happening and how to fix it the right way[tm].

benwilson512 · January 2, 2019, 2:36pm

Oh sorry it’s a timeout issue not an ownership issue, that wasn’t clear to me initially since the first stack trace didn’t include the actual error message.

To clear up a misconception though:

there is only one ecto process available, right?

no. There is a pool of connections. This makes it possible to do concurrent tests. Each concurrent test case would checkout a different connection, and then each start their own transaction. This is why the caller: argument exists to Repo calls, it can indicate to Ecto which connection process to use.

Unfortunately none of that really helps answer the timeout question. I’ll have to dig into Cachex to see how Cachex.fetch! works.

cjk · January 2, 2019, 7:48pm

The ownership issue was just an assumption I made (thus the question mark ) because it works when called from outside of the process of cachex.

Well, I will dig deeper into cachex then, thanks.

lasseebert · May 10, 2019, 11:25am

I face the same issue, but not with Cachex, just with a normal ecto transaction.

I run a bunch of db operations inside a transaction. Inside the transaction I call some logic in another process that performs a db-lookup (in other tables). This works fine when not running tests, but in the test it dies with connection not available and request was dropped from queue after 989ms.

I have not found a way around this yet.

benwilson512 · May 10, 2019, 2:24pm

https://hexdocs.pm/ecto_sql/Ecto.Adapters.SQL.Sandbox.html#module-collaborating-processes

lasseebert · May 10, 2019, 8:07pm

Thanks @benwilson512, but I already use :shared mode on the connection, so it shouldn’t make a difference. (I also tried and it didn’t).

I don’t get the cannot find ownership process error, which would be solved with allow/3 or :shared, but a connection not available.

For now, I “solved” the issue by replacing the db-call inside the existing transaction with a mock that does not hit the db.

Here is a small example that reproduces my issue:

test "calling db from another process inside a transaction" do
  Repo.transaction(fn ->
    parent = self()

    spawn_link(fn ->
      User |> Repo.one()
      send(parent, :done_in_other_process)
    end)

    receive do
      :done_in_other_process -> :ok
    end

    :ok
  end)
end

This is the setup code that is run before each db test:

:ok = Ecto.Adapters.SQL.Sandbox.checkout(MyApp.Repo)
Ecto.Adapters.SQL.Sandbox.mode(MyApp.Repo, {:shared, self()})

It will give this error:

  1) test calling db from another process inside a transaction (MyApp.MyTest)
     test/my_test.exs:153
     ** (EXIT from #PID<0.497.0>) an exception was raised:
         ** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 989ms. You can configure how long requests wait in the queue using :queue_target and :queue_interval. See DBConnection.start_link/2 for more information
             (ecto_sql) lib/ecto/adapters/sql.ex:605: Ecto.Adapters.SQL.raise_sql_call_error/1
             (ecto_sql) lib/ecto/adapters/sql.ex:538: Ecto.Adapters.SQL.execute/5
             (ecto) lib/ecto/repo/queryable.ex:153: Ecto.Repo.Queryable.execute/4
             (ecto) lib/ecto/repo/queryable.ex:18: Ecto.Repo.Queryable.all/3
             (ecto) lib/ecto/repo/queryable.ex:67: Ecto.Repo.Queryable.one/3
             test/my_test.exs:158: anonymous fn/1 in MyApp.MyTest."test calling db from another process inside a transaction"/1

benwilson512 · May 10, 2019, 9:01pm

Ah yes I see, this is different. It happens because the transaction is hogging the connection process, so there aren’t any other available connections. I’m not entirely sure how to solve that

Is it possible to do all of that lookup information before you open the transaction? This will help minimize the amount of time the transaction is open which is itself a perk, and then your transaction can focus on just doing the changes to the database.

lasseebert · May 10, 2019, 9:32pm

It would be complex to do the lookup beforehand of the transaction, since it is some logic not directly related to the code in question that does the inner db lookup in another process.

Anyway, I solved this particular issue with a mock in my tests (and it runs fine in dev and prod)

benwilson512 · May 11, 2019, 1:27pm

Good solution!

thbar · March 1, 2021, 2:40pm

Hi,

First thanks to everyone in this thread! I’ve been faced with a similar situation (using Cachex with Ecto calls), and faced the same troubles with regard to ownership - all your input has been very useful.

I’m stacking up a bit more links in case it helps someone later, because Cachex has worked reliably for us.

In my cases, it isn’t pratical to pass caller: xyz to Ecto, because it would involve changing a lot of code calls. Maybe there is a trick to set the caller for all the subsequent calls in a given block maybe, but I haven’t searched for that yet.

As described in Ecto.Adapters.SQL.Sandbox — Ecto SQL v3.11.1, if one wants to keep their tests async, then the shared mode cannot be used, and allowance must be used instead.

It can be tricky because Cachex creates a number of nested processes per cache, which one can see with:

import Cachex.Spec, only: [cache: 1]
# ...
Cachex.Services.services(Cachex.Spec.cache(name: :my_cache))

which gives:

[
  {Cachex.Services.Janitor, #PID<0.734.0>, :worker, [Cachex.Services.Janitor]},
  {Cachex.Services.Courier, #PID<0.733.0>, :worker, [Cachex.Services.Courier]},
  {Cachex.Services.Locksmith.Queue, #PID<0.732.0>, :worker,
   [Cachex.Services.Locksmith.Queue]}
]

The function passed to Cachex.fetch to lazily-compute the value also (apparently) runs in a different process.

So far the trick I’ve used is to wrap the call to “fetch” in a helper which enables the allowance during tests:

defmodule MyApp.Cache do
  require Logger

  def fetch(cache_key, value_fn) do
    parent = self()

    comp_fn = fn key ->
      if Mix.env() == :test do
        Ecto.Adapters.SQL.Sandbox.allow(DB.Repo, parent, self())
      end

      Logger.info("Generating cached value for key #{key}")
      {:commit, value_fn.()}
    end

    {operation, result} = Cachex.fetch(:my_app_cache, cache_key, comp_fn, ttl: :timer.seconds(60))
    # SNIP

Ultimately I will extract a behaviour out of this module, and use a stub in most of the tests, only keeping a real Cachex-interacting test for the test of MyApp.Cache itself.

If anyone has better solutions or more information, I’ll happily read you!

al2o3cr · March 1, 2021, 2:59pm

One gotcha with this - AFAIK there’s no way to “un-allow” a connection, and Ecto gets grumpy if you call allow with a second PID. This makes for intermittently-failing tests when there’s a worker pool, as you’ll only get an error when a worker gets reused.

We ran into this issue with :poolboy; our fix was to have workers return :stop in tests so they don’t get re-used. This works, EXCEPT for very occasionally when there’s already a process waiting for a pool worker when the call returns; a fast-path optimization in Poolboy hands the now-exiting worker PID to the waiting process, which then fails