Preloading, some of the time, all of the time, none of the time?

overcomeoj · February 5, 2021, 9:06am

I have a conflict.

It seems that the scaffolds imply that your repo interactions should be in a context, and your controllers operate via the contexts.

This makes sense, except that most of the time, the controller knows more about what it wants from the repo.

Do I want just the User record?
Or maybe I want the User and all their posts.
Or just the first 10 posts and the count on those posts comments.
Maybe I just want the ID and email?

I could just keep adding to my get_user!(id) function:

def get_user!(id) do
  Repo.one(from u in User,
 :preload [[posts:  :comments], [credentials: :key], :roles]

but that starts to look like a real bad idea, really quickly.

Do I start writing get_user_with_posts and get_user_with_posts_comments? Maybe pass down to those after passing an options list to get_user?

Really it feels like I should just be writing Ecto queries right in the controller, getting what I want for each action, but that feels like Im misbehaving.

Perhaps the cleanest way would be writing a lot of my own chainable “query builders”,

def query_for_user(id)
end
def query_user_posts(query)
end
...

query_for_user(1) |> query_user_posts() |> query_user_roles() |> query_run()

but that almost feels like Im just writing a bad abstraction around Ecto.Query, for maybe zero real gain.

I know the answer is most often “dO whaT fiTs youR proJect”, which isn’t a bad answer, but I guess Im just hoping for some more idiomatic guiding from better programmers.

LostKobrakai · February 5, 2021, 9:12am

Why do you think those are the same request and even more to be handled by querying for an user?

Do I want just the User record?

Accounts.get_user

Or maybe I want the User and all their posts.

Accounts.get_user + Blog.fetch_posts_by_author

Or just the first 10 posts and the count on those posts comments.

Accounts.get_user + Blog.fetch_latest_posts_by_author_with_comments

Maybe I just want the ID and email?

Accounts.get_user and don’t care about the few additional bits and boops you fetch from the db.

overcomeoj · February 5, 2021, 9:33am

I guess because Ive been wired to think in relations/associations (cough objects…), that my instinct is to want to “drill down” through a root node

But yes, just seeing it written like that makes a lot of sense. I have some data, I can get the related data but it doen’t have to be through that data, which also makes testing less of a setup/teardown nightmare.

I guess if I were to do something like:

# in a template

for user <- users do
   <%= user.name..
   for posts <- user.posts do
      <%= post.title...

I could instead do

for {user, posts} <- [{user, posts}...] do
   <%= user.name..
   for posts <- posts do
      <%= post.title...

I know it’s most often case by case, but is it common to work that way in Phoenix? Get my user, then get the posts, and any notifications, pass those to the view, instead of getting the user and relying on its structure?

I think I have to rethink some of my schemas…

Edit: I guess join+preload isn’t possible in that style though. Do you just eat the cost until it’s an issue?

adamzapasnik · February 5, 2021, 9:49am

Whenever I have problems like that I like to look into OSS. That’s how hexpm does it, and I follow that pattern.

Edit: I realize that it doesn’t answer your question fully, but it’s worth checking how others deal with it.

github.com

hexpm/hexpm/blob/b6d67275e6534d7f9bc3300e67ce07afc5d03bec/lib/hexpm/accounts/users.ex#L6


defmodule Hexpm.Accounts.Users do
  use Hexpm.Context
  alias Hexpm.Accounts.{RecoveryCode, TFA}
  def get(username_or_email, preload \\ []) do
    User.get(String.downcase(username_or_email), preload)
    |> Repo.one()
  end
  def public_get(username_or_email, preload \\ []) do
    User.public_get(String.downcase(username_or_email), preload)
    |> Repo.one()
  end
  def get_by_id(id, preload \\ []) do

LostKobrakai · February 5, 2021, 9:56am

It’s likely going to be a mixture. See e.g. this hexpm code. It fetches data from various places, but within related functionality still uses preloads.

github.com

hexpm/hexpm/blob/b6d67275e6534d7f9bc3300e67ce07afc5d03bec/lib/hexpm_web/controllers/package_controller.ex#L118-L201


      
          defp package(conn, repositories, package, releases, release, type) do
            repository = package.repository
            release = Releases.preload(release, [:requirements, :downloads, :publisher])
          
            latest_release_with_docs =
              Release.latest_version(releases, only_stable: true, unstable_fallback: true, with_docs: true)
          
            docs_assigns =
              cond do
                type == :package && latest_release_with_docs ->
                  [
                    docs_html_url: Hexpm.Utils.docs_html_url(repository, package, nil),
                    docs_tarball_url:
                      Hexpm.Utils.docs_tarball_url(repository, package, latest_release_with_docs)
                  ]
          
                type == :release and release.has_docs ->
                  [
                    docs_html_url: Hexpm.Utils.docs_html_url(repository, package, release),
                    docs_tarball_url: Hexpm.Utils.docs_tarball_url(repository, package, release)

This file has been truncated. show original

dimitarvp · February 5, 2021, 12:04pm

You are not writing an abstraction around Ecto.Query. You are making your own pipeline through which to enrich your query before it hits the DB. Pretty normal stuff, and a good practice too.

Have granular API. Be explicit. No magic, no implicit behaviour – resist those, always. If you need to add three separate preloads and a sorting clause, pipe them all in the right endpoint code, one by one. A little more writing? Meh. At least everyone who checks the code (including your future self) will grasp it from the first glance. Our time and brain attention are precious and scarce – their focus and efficiency must be prioritized.

Write code that’s immediately recognizable by a human.

Looking at your post, you are on the right path. Here’s how I’d slightly rewrite your code:

Users.get_query(id)
|> Users.preload_posts()
|> Users.preloads_roles()
|> Repo.all()

I don’t think you need to wrap Repo.one or Repo.all in your own functions IMO – so your query_run() might be superfluous.

olivermt · February 5, 2021, 12:55pm

I use graphql and let the controller (liveview) decide all of that without any need to anything but support batching in the service/context layer.

dimitarvp · February 5, 2021, 1:00pm

Agreed. GraphQL isn’t trivial to make cache-friendly but it absolutely excels at workflows like these where your queries can vary wildly and the frontend wants more control over them.

baldwindavid · February 5, 2021, 11:10pm

I try to be explicit using this pattern…

# Context
def list_posts(queries \\ & &1) do
  Post
  |> queries.()
  |> Repo.all()
end

def filter_posts_by_author(query, author)...
def order_posts_by_published_at(query)...
def preload_post_collaborators(query)...

# Controller
posts = Blogging.list_posts(fn query ->
  query
  |> Blogging.filter_posts_by_author(user)
  |> Blogging.order_posts_by_published_at()
  |> Blogging.preload_post_collaborators()
end)

I’ll typically end up with list_*, get, and get! and other query functions that all take that optional argument. By default, the query will just be unscoped. The filter, order, and preload functions follow a consistent naming convention.

dorgan · February 5, 2021, 11:26pm

While I like this pattern because it’s quite “ergonomic” and gets the job done, doesn’t it leak the fact that you’re retrieving data from the db/working with a query?

Like, if you need to fetch some of the data before loading associations(like lateral preloads with limits, ie top 2 comments per blog post, or computing virtual fields), you have to first use that pattern and then pipe outside of the lambda:

posts = Blogging.list_posts(fn query ->
  query
  |> Blogging.filter_posts_by_author(user)
  |> Blogging.order_posts_by_published_at()
  |> Blogging.preload_collaborators()
end)
|> Blogging.preload_top_comments(limit: 2)

Not a big deal, but I’m curious

dimitarvp · February 5, 2021, 11:39pm

Yes it does, and that’s okay.

There are ways to abstract it further and I’ve went down that rabbit hole several times. My conclusion is that most of the time it’s not worth it.

dorgan · February 5, 2021, 11:41pm

I concluded pretty much the same. I’ve seen codebases like the one from changelog.org directly use the Repo in the controllers. So while it’s often recommended to “hide” the repo and db stuff in the contexts, it seems in many cases convenience wins over that rule. So I more or less wanted to see if that’s generally a concern.

dimitarvp · February 6, 2021, 1:35am

Realistically, you can hide the DB aspect if you really want to – namely abstract away any Repo function calls. Using all the Ecto.Changeset machinery by itself doesn’t bind you to any DB at all. People are using changesets in database-less applications all the time, with great success – me included.

So there’s still two layers at play and they are best captured by the both separate libraries: ecto and ecto_sql. As long as you are using stuff from ecto you can still make heavy use of its conveniences without being bound to a DB.

baldwindavid · February 6, 2021, 2:23am

This is certainly no silver bullet, but I don’t think this pattern necessarily lends itself to leaking knowledge of the query/db. It is just that some of the function names (preload_*) and argument name (query) I have used does.

You’re probably moreso getting at the concept of the controller needing to split that into separate operations though. I guess I’m just not bothered at all by piping to Blogging.preload_top_comments. If I was somehow thrown into this codebase and could only see the public interface for Blogging, all I’d know or care is that list_posts was a function that took an anonymous function with a pipe of other functions and that preload_top_comments takes a list of Post structs.

That being said, if some super-specialized operation/query/multi is better served by a well-named dedicated function I’ll do that rather than trying to jam it into this pattern.

I’ve also at various times tried to avoid leaking any reference to the database (token_operator for example), but the additional layer of abstraction can make it more difficult to reason about and refactor / find unused functions.

nthock · February 6, 2021, 4:50am

What I did is having a helper function in my context, something like this:

  def preload_with(%User{} = user, keys) do
    user
    |> Repo.preload(keys)
  end

Whenever I need to preload, I will call this function:

user_with_posts = 
  user_id
  |> Accounts.get_user()
  |> Accounts.preload_with([:posts])

I am still at the early stage of my project where the behaviours are still not clearly defined yet.

LostKobrakai · February 6, 2021, 8:27am

I wrote a blog post about another option yesterday:

https://kobrakai.de/kolumne/data-fetching-using-livecomponents/

aseigo · February 7, 2021, 12:08am

Something I do in certain cases (which I have seen so far in the comments …) is to pass a preload option in that takes a simple list of atoms that represent the preloads I want. In the getter function, it passes the base query and the options passed in to a function that then dispatches on the preloads requested:

  defp preloads(query, opts), do: preloads(query, opts, Keyword.get(opts, :preload, []))

  defp preloads(query, _opts, []), do: query
  defp preloads(query, opts, [:boosts | rest]) do
    preloads(from(q in query, preload: :boosts), opts, rest)
  end
  defp preloads(query, opts, [_ | rest]) do
    preloads(query, opts, rest)
  end

You get the idea I also pass in all the options, as you notice, as sometimes other options influence how a preload gets implemented.

The benefits of this approach, I have found, are several:

Lean: I only write the common getter functions, and don’t have to worry about every specialization as those are covered by the preload option
Stable: Adding more preloads doesn’t require changing the API at all
DRY: I can do “fancy” things in those preloads, such as enforce relevant filtering or sorting, do sub-field preloading, etc. preventing (often fragile) duplicate code at the call-sites.
User-pay: If a caller doesn’t request any preloads … it pays no performance penalties.
Explicit: It is clear at every callsite what data is being requested

I don’t do this for every schema, obviously. Just the ones that have larger sets of assocs and/or which get used a lot in the code base. It is a useful pattern I have noticed I used regularly, however.

slouchpie · February 18, 2021, 3:39am

It’s a nice idea to have a separate component for different data-loading “contexts” but I would be hesitant to add a virtual field to a schema just for something like job_count.

LostKobrakai · February 18, 2021, 7:38am

Me too. But it‘s what I‘ve done in the past and I wanted to show a range of options even if not all are equally viable for the specific example. Take e.g. a computed value for the the virtual field for the schema data itself and it might suddenly feel much more doable.