Advantages of Ecto preloads with/without joins

jmurphyweb · April 19, 2019, 1:54pm

I have a slow query as my data has grown. I wanted to test joining before preloads to see if i could speed it up.

I did a test on some mock data, first with the join and the second without. It seems like the speed gain is minimal.

Are there any other benefits to joining first before preloads?

Is it more efficient for the databases memory usage if we join before preloading?

If I did some additional nested preloads would that start to affect things?

With join

iex(1)> query = from tc in Timecard, join: tc_d in Timecards.TimecardData, on: tc.id == tc_d.timecard_id, preload: [timecard_data: tc_d]

#Ecto.Query<from t0 in Ev2.Timecards.Timecard,
 join: t1 in Ev2.Timecards.TimecardData, on: t0.id == t1.timecard_id,
 preload: [timecard_data: t1]>

Has the following output:

iex(2)> Repo.all(query)
[debug] QUERY OK source="timecards" db=1242.5ms decode=760.5ms
SELECT t0."id", t0."date", ...,  t0."offer_id", t0."timecard_setting_id", t0."inserted_at", t0."updated_at", t1."id", t1."approver_id", ... , t1."updated_at" FROM "timecards" AS t0 INNER JOIN "timecard_data" AS t1 ON t0."id" = t1."timecard_id" []

Which looks like 1243ms.

Without join

Whereas the following with no joins and super simple syntax:

iex(3)> q2 = from tc in Timecard, preload: :timecard_data
#Ecto.Query<from t in Ev2.Timecards.Timecard, preload: [:timecard_data]>

Has the following output:

iex(4)> Repo.all(q2)
[debug] QUERY OK source="timecards" db=321.0ms decode=116.9ms
SELECT t0."id", t0."date", t0."week_ending_date", t0."datetime_submitted", t0."auto_submitted?", t0."auto_generated?", t0."employee_id", t0."proxy_creator_id", t0."offer_id", t0."timecard_setting_id", t0."inserted_at", t0."updated_at" FROM "timecards" AS t0 []
[debug] QUERY OK source="timecard_data" db=965.8ms decode=322.6ms
SELECT t0."id", t0."approver_id", ... , t0."timecard_id" FROM "timecard_data" AS t0 WHERE (t0."timecard_id" = ANY($1)) ORDER BY t0."timecard_id" [[15716, 15693, ...]]

Which to me looks like a total of 321 + 965 = 1286ms

Thanks for any input!

hubertlepicki · April 19, 2019, 2:16pm

I think the issue with join-preloads can arise when you are attempting to preload has_many or many-to-many relationships. For these, the resulting SQL will output the same column values multiple times. If these are large text/blobs, the amount of data to be transferred over the wire quickly grows. Ecto deduplicates that on it’s end when returing data to the user so it is largely invisible, but joins behind the scenes still create this huge result that has to be moved from DB to Elixir in first place.

So, I think it makes total sense to preload something like belongs_to with joins, and possibly associations that just have a few related (small) records could be beneficial but for has_many associations probably you generally don’t want to do join-preloads.

check out this post: https://www.amberbit.com/blog/2019/4/16/composing-ecto-queries-filters-and-preloads/ and discussion I had with @michalmuskala on twitter: https://twitter.com/michalmuskala/status/1118120234991403009

jmurphyweb · April 19, 2019, 2:46pm

Thanks a lot for the input @hubertlepicki . I like the article . Although couldn’t access the twitter discussion?

What is the advantage of ever using join before preload then? For example in your post:

If you then execute this code, you will get the list of Posts with their Authors already loaded, but importantly you will perform that operation using single SQL command behind the scenes:

hubertlepicki · April 19, 2019, 3:39pm

This operation is very likely to be faster, especially as the number of records grow larger.

al2o3cr · April 19, 2019, 5:56pm

One reason that hasn’t been brought up yet to preload from joins: if you’re running a query that needs to check the related records anyways. For instance, in the Timecard example up top you might be looking to retrieve all Timecards that have an associated TimecardData with a specific approver_id.