I need to query my users table to perform some actions with the users. Since the table may be very big, I am opting for Repo.stream
instead of Repo.all
with the objective of performing smaller queries and avoiding having all the results in memory at once.
My surprise came when I checked the queries sent to the DB and found that there were more of them that I initially expected.
Here is an example:
iex(1)> Repo.all(User, prefix: "tenant_demo") |> Enum.count()
[debug] QUERY OK source="users" db=17.0ms decode=33.4ms
SELECT u0."id", u0."email", u0."encrypted_password", u0."username", u0."first_name", u0."last_name" FROM "users" AS u0 []
862
So, at the moment I have 862 users in the database.
Since the default batch size for Repo.stream
is of 500 records. I expected that the following code sent only 2 queries for the database. As you can see, 4 queries where executed instead:
iex(2)> Repo.transaction(fn -> Repo.stream(User, prefix: "tenant_demo") |> Enum.to_list() |> Enum.count() end)
[debug] QUERY OK db=0.9ms
begin []
[debug] QUERY OK source="users" db=3.6ms
SELECT u0."id", u0."email", u0."encrypted_password", u0."username", u0."first_name", u0."last_name", u0."avatar", u0."role", u0."gender", u0."birth", u0."birthplace", u0."joined_at", u0."left_at", u0."phone", u0."address", u0."city", u0."company_phone", u0."salary", u0."currency", u0."offboarding_at", u0."status", u0."id_card", u0."locale", u0."last_login_at", u0."week_start", u0."inserted_at", u0."updated_at", u0."absences_setting_id", u0."absences_location_id", u0."department_id", u0."country_id", u0."headquarter_id", u0."manager_id" FROM "tenant_demo"."users" AS u0 []
[debug] QUERY OK source="users" db=10.7ms decode=23.5ms
SELECT u0."id", u0."email", u0."encrypted_password", u0."username", u0."first_name", u0."last_name", u0."avatar", u0."role", u0."gender", u0."birth", u0."birthplace", u0."joined_at", u0."left_at", u0."phone", u0."address", u0."city", u0."company_phone", u0."salary", u0."currency", u0."offboarding_at", u0."status", u0."id_card", u0."locale", u0."last_login_at", u0."week_start", u0."inserted_at", u0."updated_at", u0."absences_setting_id", u0."absences_location_id", u0."department_id", u0."country_id", u0."headquarter_id", u0."manager_id" FROM "tenant_demo"."users" AS u0 []
[debug] QUERY OK source="users" db=8.9ms decode=42.7ms
SELECT u0."id", u0."email", u0."encrypted_password", u0."username", u0."first_name", u0."last_name", u0."avatar", u0."role", u0."gender", u0."birth", u0."birthplace", u0."joined_at", u0."left_at", u0."phone", u0."address", u0."city", u0."company_phone", u0."salary", u0."currency", u0."offboarding_at", u0."status", u0."id_card", u0."locale", u0."last_login_at", u0."week_start", u0."inserted_at", u0."updated_at", u0."absences_setting_id", u0."absences_location_id", u0."department_id", u0."country_id", u0."headquarter_id", u0."manager_id" FROM "tenant_demo"."users" AS u0 []
[debug] QUERY OK source="users" db=1.2ms
SELECT u0."id", u0."email", u0."encrypted_password", u0."username", u0."first_name", u0."last_name", u0."avatar", u0."role", u0."gender", u0."birth", u0."birthplace", u0."joined_at", u0."left_at", u0."phone", u0."address", u0."city", u0."company_phone", u0."salary", u0."currency", u0."offboarding_at", u0."status", u0."id_card", u0."locale", u0."last_login_at", u0."week_start", u0."inserted_at", u0."updated_at", u0."absences_setting_id", u0."absences_location_id", u0."department_id", u0."country_id", u0."headquarter_id", u0."manager_id" FROM "tenant_demo"."users" AS u0 []
[debug] QUERY OK db=1.4ms
commit []
{:ok, 862}
I’ve tested different batch sizes, and it seems that there are always 2 more queries than expected. For example, with a batch size of 100 I would expect 9 queries to be executed, but instead there were 11.
Is this behaviour expected? If so, is it documented anywhere? I’ve checked the docs for Ecto.stream
with no luck so far.
I don’t know if it matters, but I am running PostgreSQL 9.5.
Thank you all!