Repo.stream triggers more queries than expected

I need to query my users table to perform some actions with the users. Since the table may be very big, I am opting for Repo.stream instead of Repo.all with the objective of performing smaller queries and avoiding having all the results in memory at once.

My surprise came when I checked the queries sent to the DB and found that there were more of them that I initially expected.

Here is an example:

iex(1)> Repo.all(User, prefix: "tenant_demo") |> Enum.count()
[debug] QUERY OK source="users" db=17.0ms decode=33.4ms
SELECT u0."id", u0."email", u0."encrypted_password", u0."username", u0."first_name", u0."last_name" FROM "users" AS u0 []
862

So, at the moment I have 862 users in the database.

Since the default batch size for Repo.stream is of 500 records. I expected that the following code sent only 2 queries for the database. As you can see, 4 queries where executed instead:

iex(2)> Repo.transaction(fn -> Repo.stream(User, prefix: "tenant_demo") |> Enum.to_list() |> Enum.count() end)
[debug] QUERY OK db=0.9ms
begin []
[debug] QUERY OK source="users" db=3.6ms
SELECT u0."id", u0."email", u0."encrypted_password", u0."username", u0."first_name", u0."last_name", u0."avatar", u0."role", u0."gender", u0."birth", u0."birthplace", u0."joined_at", u0."left_at", u0."phone", u0."address", u0."city", u0."company_phone", u0."salary", u0."currency", u0."offboarding_at", u0."status", u0."id_card", u0."locale", u0."last_login_at", u0."week_start", u0."inserted_at", u0."updated_at", u0."absences_setting_id", u0."absences_location_id", u0."department_id", u0."country_id", u0."headquarter_id", u0."manager_id" FROM "tenant_demo"."users" AS u0 []
[debug] QUERY OK source="users" db=10.7ms decode=23.5ms
SELECT u0."id", u0."email", u0."encrypted_password", u0."username", u0."first_name", u0."last_name", u0."avatar", u0."role", u0."gender", u0."birth", u0."birthplace", u0."joined_at", u0."left_at", u0."phone", u0."address", u0."city", u0."company_phone", u0."salary", u0."currency", u0."offboarding_at", u0."status", u0."id_card", u0."locale", u0."last_login_at", u0."week_start", u0."inserted_at", u0."updated_at", u0."absences_setting_id", u0."absences_location_id", u0."department_id", u0."country_id", u0."headquarter_id", u0."manager_id" FROM "tenant_demo"."users" AS u0 []
[debug] QUERY OK source="users" db=8.9ms decode=42.7ms
SELECT u0."id", u0."email", u0."encrypted_password", u0."username", u0."first_name", u0."last_name", u0."avatar", u0."role", u0."gender", u0."birth", u0."birthplace", u0."joined_at", u0."left_at", u0."phone", u0."address", u0."city", u0."company_phone", u0."salary", u0."currency", u0."offboarding_at", u0."status", u0."id_card", u0."locale", u0."last_login_at", u0."week_start", u0."inserted_at", u0."updated_at", u0."absences_setting_id", u0."absences_location_id", u0."department_id", u0."country_id", u0."headquarter_id", u0."manager_id" FROM "tenant_demo"."users" AS u0 []
[debug] QUERY OK source="users" db=1.2ms
SELECT u0."id", u0."email", u0."encrypted_password", u0."username", u0."first_name", u0."last_name", u0."avatar", u0."role", u0."gender", u0."birth", u0."birthplace", u0."joined_at", u0."left_at", u0."phone", u0."address", u0."city", u0."company_phone", u0."salary", u0."currency", u0."offboarding_at", u0."status", u0."id_card", u0."locale", u0."last_login_at", u0."week_start", u0."inserted_at", u0."updated_at", u0."absences_setting_id", u0."absences_location_id", u0."department_id", u0."country_id", u0."headquarter_id", u0."manager_id" FROM "tenant_demo"."users" AS u0 []
[debug] QUERY OK db=1.4ms
commit []
{:ok, 862}

I’ve tested different batch sizes, and it seems that there are always 2 more queries than expected. For example, with a batch size of 100 I would expect 9 queries to be executed, but instead there were 11.

Is this behaviour expected? If so, is it documented anywhere? I’ve checked the docs for Ecto.stream with no luck so far.

I don’t know if it matters, but I am running PostgreSQL 9.5.

Thank you all! :bowing_man:

8 Likes

I have noticed the same thing. Have you figured out the reason for this?

I see that there is some misunderstanding about this functionality so I will try to explain it.

At the beginning, debug messages aren’t equivalent to things that take place under the hood. In our case, we don’t use this query 4 times but only 1. At the start, ecto transforms our call into a query from which statement is prepared. This statement is later used to fetch data. Thanks to that, the query is resolved only once and we can later refer to it by the statement (it specially matters when we have more complicated queries). Summarising, there is only one query which results in one statement, nothing more.

You may do a little experiment by yourself to verify it:

# At the beginning, we have to start a debugger
:dbg.stop_clear()

:dbg.tracer(
  :process,
  {fn
      {:trace, _pid, :call, {module, fun, args}}, _acc ->
        IO.puts("#{module}.#{fun}/#{length(args)}")

      _term, _acc ->
        nil
    end, nil}
)

:dbg.p(:all, :call)

:dbg.tp(Postgrex.Protocol, :_, :exception_trace)
:dbg.ctp(Postgrex.Protocol, :ping)

# Now we can run our code
Repo.transaction(fn -> Repo.stream(User) |> Enum.to_list() |> Enum.count() end)

We should get the following output:

Elixir.Postgrex.Protocol.handle_begin/2
[debug] QUERY OK db=3.1ms idle=1366.3ms
begin []
Elixir.Postgrex.Protocol.handle_prepare/3
Elixir.Postgrex.Protocol.handle_declare/4 # prepares a new statement
Elixir.Postgrex.Protocol.handle_fetch/4 # fetch data using the statement
[debug] QUERY OK source="users" db=2.2ms
SELECT u0."id", u0."bio", u0."email", u0."name", u0."number_of_pets", u0."inserted_at", u0."updated_at" FROM "users" AS u0 []
[debug] QUERY OK source="users" db=9.5ms
SELECT u0."id", u0."bio", u0."email", u0."name", u0."number_of_pets", u0."inserted_at", u0."updated_at" FROM "users" AS u0 []
Elixir.Postgrex.Protocol.handle_fetch/4 # fetch data using the statement
[debug] QUERY OK source="users" db=6.8ms
SELECT u0."id", u0."bio", u0."email", u0."name", u0."number_of_pets", u0."inserted_at", u0."updated_at" FROM "users" AS u0 []
Elixir.Postgrex.Protocol.handle_deallocate/4 # destroy the statement
[debug] QUERY OK source="users" db=2.1ms
SELECT u0."id", u0."bio", u0."email", u0."name", u0."number_of_pets", u0."inserted_at", u0."updated_at" FROM "users" AS u0 []
Elixir.Postgrex.Protocol.handle_commit/2
[debug] QUERY OK db=3.3ms
commit []

When it comes to the number of debug messages, we receive always 2 additional messages because there are 2 mandatory steps: :prepare_declare and :deallocate. You may inspect this in Ecto.Adapters.SQL.log/4 (by checking entry.call).

8 Likes

Great explanation @lovyou :raised_hands:
Thank you!