Your thoughts on best practices around Ecto preloads

Rodeoclash · March 4, 2024, 11:27am

I’m interested in how others work with preloads in Ecto - I suppose what I’m looking for is some vague best practices around how to treat them and I figured getting some insight into how others approach them might give me a few ideas I could start to apply.

Currently, I just preload in the function that works with the relationship. This works in the sense that I always have the data I need no matter where I call the function from, however, I can end up in situations where I do a lot of loads. The worst case being a function that is called from inside of a loop of some kind which preloads every iteration.

Obviously the solution is to move the preload up a level to the parent so that when the data structure is passed into the function with the preload, nothing is done as the relationship already exists.

The problem here is that remembering to call the preloads from caller functions becomes kind of a manual process (i.e. keeping one eye on the logs to see the explosion of SELECT’s come in!)

A few thoughts on this.

Should I just be avoiding the Repo.preload form of the preload and relying on using preload when I construct my queries instead?
Have I possibly missed some kind of tool that can detect looped preloads and warn me?

Apologies if this is too vague. It’s not a show stopper for how I’m programming, just very interested in getting a good “mental model” around how to handle preloads as it’s one thing I’ve never felt I’ve done very neatly when working with ecto.

sbuttgereit · March 4, 2024, 2:44pm

Here’s my rule(s) of thumb for interacting with the database and Ecto preloads. Note that these are rules of thumb and my rule of thumb for rules of thumb are that there are no absolute rules of thumb (and I mean that absolutely).

Don’t retrieve data that you don’t need.

While it can be convenient during development to have all the data and decide later if you’ll need it or not, you will inevitably end up with the kinds of issues that will lead you ask the question you are asking. Also the kinds of performance related issues induced by over-querying tend to become more pronounced with load, meaning that queries that are acceptable during development may become less acceptable once in production. Depending on the nature of the preloads, this may be more pronounced because the latency of round-trips to the database server will tend to be high compared to other sources of time spent in the application… this also tends to be one of those things masked in development as compared to production where developers frequently develop on local databases, but in production the application and database are often not running on the same host and will have additional network latencies.

I don’t worry too much about over-querying fields of any one record, though for rarely needed fields I’ve started designating them (load_in_query: false) directly in the Ecto schema definitions.
I will typically write preloads as part of the query.

When I write a query for a particular purpose, I usually known what data I need and whether or not I want the database to perform a join operation to retrieve the data or if multiple queries will be more efficient for pre-loading. If it varies or the query might be reusable, I may make the preloads optional for the caller (data = get_data(preloads: true) or something to that effect) so that I don’t over-query the database.

I can absolutely see where Repo.preload could be helpful in avoiding unnecessary data retrieval… maybe do a permission check using the data of a retrieved parent record and then if it passes, call Repo.preload to get the additional data. But I wouldn’t use it all the time since I can specify it in the query… that’s really how I see the correct use of that function… being sure you need that extra data before retrieving it.

sodapopcan · March 4, 2024, 3:01pm

This all thumbs up very thumbly with my thumby rules of thumb.

I used to make myself a :preload option where I’d just pass the named preloads. Like so:

def list_records(opts \\ []) do
  preloads = Keyword.get(opts, :preload, [])

  Record
  |> Repo.all()
  |> Repo.preload(preloads)
end

This is of course a bit leaky but the templates know these keys anyway so I don’t hate it. I would still do this in certain situations but now I often just call separate context functions. It leads to clearer and easier to managed views IMO as you have separate assigns for the pertinent data. It’s really situational, though, and honestly I’m still learning here.

There are times when one thing doesn’t make any sense without the other in which case I will just always preload. For example I had a project that had Artwork and Artist. There was no situation where fetching artwork, either a single record or in a list, would ever make sense without the artist, so that gets a perma-preloaded.

sbuttgereit · March 4, 2024, 10:00pm

Sure. But the key is that you’re consciously recognizing a fact of the data’s nature: that there isn’t a case where having a record without its associations populated makes sense. Nothing wrong with that (or at least in principle nothing wrong with that).

What I recommend against is retrieving data for which you don’t have a well considered need.

I find in practice that I don’t use the optional :preload technique much myself. Like you, I typically will have functions built for specific purposes which eliminate the need for that approach. However, once I’m there, and if the query is sufficiently complex, I’ll have some private functions to deal with things taking advantage of the fact that Ecto allows us to compose queries (one of the bigger selling points in my opinion). So, if I have a number of different, but similar queries, and there is an identifiable complex base query behind them all, I’ll have a private function to just return the base Ecto query and then I’ll add to that in the more specialized functions including pre-loads additional joins, etc. as needed before sending it to the database. That way I’m not constantly writing variations of the same query and I stand a chance of easier maintenance if things change. But to be clear, there is a threshold where the query complexity has to be high enough for that composition approach to be beneficial. Trivial queries I find are just simpler to write the query whole.

sodapopcan · March 5, 2024, 5:00pm

Yep, I certainly don’t go looking for these scenarios I feel its probably somewhat rare. My case was a non-multitenant app where all (published) artwork was visible to everyone. There were several instances of different artworks by different artists having the exact same name, so there is literally no scenario where get_artwork would make sense without also getting the artist. If it was a platform for artists to privately manage their artwork that would be a very different story.

And ya, very aligned on everything else there! I’m also a big fan of pushing a good chunk of business logic to the database. For me all calculations and the vast majority of aggregations belong in the database and Ecto is very good at this. Not all would agree, but that’s ok

EDIT: Added a pretty key “don’t” in the first sentence