Preload multiple associations with only one join?

dli · November 28, 2024, 2:24pm

I have two schemata Person and Book.
Both Book.author_id and Book.illustrator_id reference Person.id.

I want to load all books and preload all associated people. My query looks like this:

from(b in Book,
  join: a in assoc(b, :author),
  join: i in assoc(b, :illustrator),
  preload: [author: a, illustrator: i]
)
|> Repo.all()

I have a very large number of books. Would it be possible to join Person only once and let an Ecto preload function handle the matching?

Strangely enough, the following code correctly preloads the author, but illustrator is always identical to author which is wrong.

join: p in Person,
on: p.id in [b.author_id, b.illustrator_id],
preload: [author: p, illustrator: p] # ???

LostKobrakai · November 28, 2024, 5:11pm

SQL will always return tabular data, no matter how hard one tries to think of it as nested or treelike.

So given a row with book details you add columns for person data using a join. But you only get enough columns for a single joined person. If you want to join another person you need to do another join, which will one more set of columns to the resulting data.

That’s why your preload doesn’t work as you expect. You’re joining either an author or an illustator and then telling ecto to use that single set of data for both relations.

dli · November 29, 2024, 12:05pm

If you want to join another person you need to do another join, which will one more set of columns to the resulting data.

Not sure about that. I know my approach isn’t right but I’ll leave some notes for other readers who encounter a similar situation:

This query gives me two rows per book, one with the joined author and one with the joined illustrator and only one set of Person columns.

SELECT b.*, p.*
FROM books b
LEFT JOIN people p ON p.id = b.author_id OR p.id = b.illustrator_id;

b.id	b.title	b.type	b.author_id	b.illustrator_id	p.id	p.name
1	War and Art	hardcover	1	2	1	Leo Wolfe
1	War and Art	hardcover	1	2	2	Maria Chen
2	Ocean Tales	paperback	2	4	2	Maria Chen
2	Ocean Tales	paperback	2	4	4	David Singh

All the data is right there. We don’t get a guaranteed order for the joined rows but can still match the IDs from the result set.

This kind of processing is similar to Ecto preloads with separate queries (no join), but without using a separate SELECT * FROM person WHERE person.id IN (1, 2, 3, ...) query.

This is a tradeoff between:

the cost of duplicate data transferred for Book entities
the cost of looking up a bunch of specific IDs (Ecto preload w/o join, using separate pass)
the cost of joining the same table twice

In my case, it turned out that the query above (one join) is less efficient than two joins. This means that here, the standard Ecto preload with two joins is ideal:

from(b in Book,
  join: a in assoc(b, :author),
  join: i in assoc(b, :illustrator),
  preload: [author: a, illustrator: i]
)
|> Repo.all()

Your mileage may vary! EXPLAIN ANALYZE is your friend.

LostKobrakai · November 29, 2024, 12:23pm

While this is generally true I doubt there’s many cases if at all where joining distinct data as separate rows over joining as separate columns will yield a performance boost. That’s something I’d only start touching as one of the later options once a query in question was proven to be a hotspot. Foreign keys on primary keys of other tables are hella fast, because the primary key is the the primary way to look up information on that table. It’s afaik always indexed and basically the first thing any db engine will make sure is fast.