Ecto query cache seems to grow without limits

dli · August 3, 2024, 7:57pm

I use Repo.insert_all to bulk insert ~100m rows in chunks. After each chunk, some outdated data based on the insert’s RETURNING is deleted with Repo.delete_all.

The query that selects the rows to delete uses join: values(...) and has around 200-600 parameters, depending on the situation. These queries are homogeneous when you factor out the VALUES list.

After inserting all rows, the ETS Repo cache table has 1841 objects and uses a whopping ~360 MB of memory. It looks like each distinct parameter count is PREPAREd and cached individually.

It seems unnecessary to cache these highly specific DELETE queries. Can I bypass the cache for these queries, or what other options do I have?

dimitarvp · August 3, 2024, 8:12pm

Which ETS table, sorry? Who is managing it?

dli · August 3, 2024, 8:18pm

I am referring to the built-in Ecto query cache. It’s pretty much undocumented but it exists, and it makes my memory usage grow : ecto/lib/ecto/query/planner.ex at master · elixir-ecto/ecto · GitHub

Observer returned the table stats mentioned in my first post.

mpope · August 3, 2024, 8:47pm

Can you try to set prepare to :unnamed in your connection? I have trouble finding if this will disable the cache specifically but it’s worth a try.

dli · August 3, 2024, 9:09pm

Didn’t make any difference. Statements still show up in ETS and eat up memory.

Interestingly enough, insert statements don’t show up regardless of prepare: :unnamed.

config :my_app, MyApp.Repo,
   username: "postgres",
   password: "postgres",
   hostname: "localhost",
   database: "my_app_data_dev",
   log: :debug,
   pool_size: 10,
   stacktrace: true,
   show_sensitive_data_on_connection_error: true,
+  prepare: :unnamed

fuelen · August 4, 2024, 7:28am

Try to rewrite the query to json_to_recordset instead of values. In this case, you’ll pass only a single parameter.

josevalim · August 4, 2024, 8:12am

We fixed this in Ecto to not use query cache for values. If you can pass the data as parameter, as @fuelen suggests, that would be certainly best.

benwilson512 · August 4, 2024, 1:42pm

This isn’t exactly what you’re asking but if you’re in the millions of rows category I would strongly suggest using COPY instead of bulk inserts. COPY is both more performant at the postgres level, and will also bypass the query cache issue you’re having.

Here is a helper module we have for this operation:

defmodule MyApp.PostgresBulkLoader do
  require Logger

  def load(repo, table, stream, columns) do
    statement = """
    COPY #{table} (#{Enum.join(columns, ", ")})
    FROM STDIN
    WITH (FORMAT csv, HEADER false)
    """

    {:ok, :ok} =
      repo.transaction(
        fn ->
          Logger.debug(statement)

          stream
          |> Stream.chunk_every(2000, 2000, [])
          |> Stream.into(Ecto.Adapters.SQL.stream(repo, statement))
          |> Stream.run()
        end,
        timeout: 3_600_000
      )

    :ok
  end
end

It’s mildly tedious because you have to basically build CSV of the data you’re ingesting but it’s well worth it if you’re in the 100m rows world.

dli · August 5, 2024, 11:39am

Thank you @josevalim @fuelen @benwilson512. The Ecto fix is highly appreciated. I will try json_to_recordset before the new Ecto version is out. Looks like I can get away with much larger batches when I can squeeze the entire data set into one parameter.

I will also try the COPY statement + figure out a new way to retrieve the data I previously obtained via RETURNING.

this community – it’s great that everyone is so involved.

dli · August 5, 2024, 11:42am

We fixed this in Ecto to not use query cache for values .

For the record, here’s a link to the fix PR: Disable cache for values lists by greg-rychlewski · Pull Request #4471 · elixir-ecto/ecto · GitHub

nono.31 · November 14, 2024, 9:49pm

Hi @dli

Did the fix solve you issue?

We have an Ecto Repo caching problem and we are on latest Ecto version.

Thanks.

dli · November 14, 2024, 10:34pm

Yes, the PR fix was the solution in my case.