How best to write context functions for more complex cases?

TLDR; Is this a good way to writing context functions? Especially when preloads are nested, or for some other reason there is some complexity. If not, what is the way to go?

I was reading the topic Preloading, some of the time, all of the time, none of the time? from some years ago. In the topic a number of different approaches to implementing context functions are mentioned. It also covers when to preload and why, as the title of the referenced topic suggests.

I rewrote a get_post/2 function of mine, because after have read the mentioned topic, among others, I felt it needed improvement.

How close is this to what could be considered good practice? Am I missing something still? For example, will this approach bite me later on, when requirements shift?

One thing I noticed myself is that my context module now has a lot more private functions in it than before. Those distract a bit from the top level functions that are actually the ones that I would call from the web layer. Do you put these private functions somewhere else? For example, under the post schema in Post.ex?

Quick background: A post has many post replies. And a post reply has many post subreplies. The post and post reply schema’s each have a virtual field for the (sub)reply count.

@doc """
  Returns the post with the given `id`.

  ## Options
  * `:preload_user` - preload the user association of the post, its replies, and its subreplies (:all), or a keyword list of fields to select from the user table
  * `:preload_replies` - a boolean indicating whether to preload the replies association (default: false)
  * `:preload_subreplies` - a boolean indicating whether to preload the subreplies association (default: false)
  * `:with_reply_count` - a boolean indicating whether to preload `:reply_count` and `:subreply_count` virtual fields (default: false)

  ## Example

      %Post{} = Posts.get_post(
        post_id,
        preload_user: [:id, :avatar, :username],
        preload_replies: true,
        preload_subreplies: true,
        with_reply_count: true
      )

  """

  def get_post(id, opts \\ []) do
    from(p in Post, where: p.id == ^id)
    |> add_reply_count(opts)
    |> preload_user(opts)
    |> preload_replies(opts)
    |> preload_subreplies(opts)
    |> Repo.one()
  end

  defp add_reply_count(query, opts) do
    case Keyword.get(opts, :with_reply_count, false) do
      true ->
        from post in query,
          left_join: reply in assoc(post, :replies),
          left_join: subreply in assoc(reply, :subreplies),
          group_by: [post.id],
          select_merge: %{reply_count: count(reply.id, :distinct) + count(subreply.id)}

      _ ->
        query
    end
  end

  defp preload_user(query, opts) do
    preload_user = Keyword.get(opts, :preload_user)

    case preload_user do
      :all ->
        from q in query,
          preload: [:user]

      nil ->
        query

      fields ->
        user_query =
          from u in User,
            select: ^fields

        from q in query,
          preload: [user: ^user_query]
    end
  end

  defp preload_replies(query, opts) do
    case Keyword.get(opts, :preload_replies) do
      true ->
        replies_query =
          from(PostReply)
          |> sort_by_inserted_at()
          |> add_subreply_count(opts)
          |> preload_user(opts)

        from post in query,
          preload: [replies: ^replies_query]

      _ ->
        query
    end
  end

  defp sort_by_inserted_at(query) do
    from q in query,
      order_by: [asc: q.inserted_at]
  end

  defp add_subreply_count(query, opts) do
    case Keyword.get(opts, :with_reply_count, false) do
      true ->
        from reply in query,
          left_join: subreply in assoc(reply, :subreplies),
          group_by: [reply.id],
          select_merge: %{subreply_count: count(subreply.id)}

      _ ->
        query
    end
  end

  defp preload_subreplies(query, opts) do
    case Keyword.get(opts, :preload_subreplies) do
      true ->
        subreplies_query =
          from(PostSubreply)
          |> sort_by_inserted_at()
          |> preload_user(opts)

        from subreply in query,
          preload: [replies: [subreplies: ^subreplies_query]]

      _ ->
        query
    end
  end
1 Like

I’m not sure there’s much consense of how exactly to write that portion of you codebase.

The best way to prepare for future requirements is make it easy to throw away the current code and replace it wholesale (https://www.youtube.com/watch?v=1FPsJ-if2RU). There’s no way to know what future requirements will be, so trying to cater to them is guesswork at best.

On a general note I’d always suggest to start with less abstraction (less complex parameters) and more distinct functions than the other way round. It leads to the above, but also means you discover useful abstractions rather than imagine them to be useful. Simpler more distinct functions should help against “just let this existing function do one more thing”.

Your example would lend itself to extacting common query manipulating functions into their own module. My suggestion for learning how to do layering in code (without necessarily buying into buzzword architecture) would be buying “grokking functional programming”. It has a few great chaptures on how to build larger stuff out of smaller pieces and how those layers should depend on each other (or not).

5 Likes

It definitely felt like I was writing an unnecessarily general get_post/2 function. In reality I currently only need to be able to do the following.

Posts.get_post(
        post_id,
        preload_user: [:id, :avatar, :username],
        preload_replies: true,
        preload_subreplies: true,
        with_reply_count: true
      )

I got tempted…

:roll_eyes:

Maybe silly, but thinking of a good name for the function that would be equivalent to

Posts.get_post(
        post_id,
        preload_user: [:id, :avatar, :username],
        preload_replies: true,
        preload_subreplies: true,
        with_reply_count: true
      )

threw me off.

:sweat_smile:

If it were me I would create my queries separately to my context functions and build out the semantic context functions using the queries and schema modules.

In my query module I would expose those preload functions and let the caller (the context module) decide what they need by chaining them together. I would allow specifying options to those query functions (like fields to return), and perhaps some sort and aggregate helpers also.

Context should be high level enough that for callers (eg LiveView or controller actions) it would not matter if your entire backend storage layer changed. Generally I think of context methods as orchestrating actions. If they involve multiple resources or a mix of ecto, external apis, sending email or pubsub notifications it shouldn’t matter.

7 Likes

Like so?

@doc """
Gets data of a post that is necessary for rendering  a dialog component.
  
## Options
  ...all (preload) options...
  ...preloading is not optional anymore, since the function is now specialized...

"""
  def get_post_for_dialog(id, opts \\ []) do
    from(p in Post, where: p.id == ^id)
    |> Posts.Query.add_reply_count()
    |> Posts.Query.preload_user(opts)
    |> Posts.Query.preload_replies(opts)
    |> Posts.Query.preload_subreplies(opts)
    |> Repo.one()
  end

Or without passing the opts to get_post_for_dialog? Like this. So reading the function becomes more semantic (explains more completely what to expect the function to do/return)?

  def get_post_for_dialog(id) do
    from(p in Post, where: p.id == ^id)
    |> Query.add_reply_count()
    |> Query.preload_user(fields: [:id, :avatar, :username])
    |> Query.preload_replies(with_user?: true, with_subreply_count?: true)
    |> Query.preload_subreplies(with_user?: true)
    |> Repo.one()
  end

Edit: I use a UI concept in the function name. Is that cursed?

We are doing something similar at work, but we generate all the functions and they follow the same naming convention.

The problem with

  def get_post(id, opts \\ []) do
    from(p in Post, where: p.id == ^id)
    |> add_reply_count(opts)
    |> preload_user(opts)
    |> preload_replies(opts)
    |> preload_subreplies(opts)
    |> Repo.one()
  end

is that you can’t reuse it for other contexts, as it is too specific for the Post schema.

We have something like this in all contexts:

use MyApp.Context,
    queries: MyApp.Queries.ActionQueries,
    schema: MyApp.Schemas.Action

And it generates def get_action(action_id, opts \\ []) and def list_actions(opts \\ []) and some other stuff based on the schema name.

3 Likes

That makes a lot of sense, though that’s easier to do when the code is quicker to write and easier to read, eg. using something like EctoSparkles — ecto_sparkles v0.1.2 (shameless plug) for joins/preloads which is less verbose than the standard ecto syntax.

Pretty much either of those is ok, and as @egze provided.

The key is to be declarative, the what and not the how or exposing the implementation. So specifying the fields or aggregates or order you want is fine.

But also think beyond just Ecto concerns in your contexts. You may also notify on certain events using Phoenix pub sub topics, send emails or orchestrate additional activities via Oban, call web hooks for integrations and so on.

Some other things to think about…

Think beyond CRUD concepts. If you had say a ticketing system you would have semantic actions like open ticket, close ticket.

Think about access control enforcement in your contexts.

Aim for thin controllers and live views and fat contexts where the guts of your application and logic lives.

3 Likes

I typically create functions that are tailored to their business purpose—or, ahem context—as opposed to generic functions that take all sorts of parameters. In your example you have way too many permutations that I imagine you aren’t going to be using even half of (maybe only 2 or 3?). Write getters that describe the purpose you are getting them for (and I’m now seeing @adw632 has just said the same thing as I’m writing this).

I also sometimes forgo preloads and just run separate queries from separate context functions. Contrary to popular belief, running a single query isn’t guaranteed to be more efficient than running multiple. Not only is this true of SQL its especially true in Ecto since a single query will return a single, denormalized table which must be reduced in Elixir. This is slooow for larger datasets. In any event, this is totally situational but can work well in some cases. I would say Posts and Replies is one of these situations depending on how you have your data set up. Like, if creating a new reply is done through Posts.update_post(params) where there are cast_assocs all the way down, then obviously you need to preload. But if you have more specific, business-named functions like Posts.reply(user, post, params) or Replies.reply_to_post(user, post, params) then you aren’t gaining much from preloading (if you’re doing the multiple-query kind) other than being able to do @post.replies instead of just @replies.

3 Likes

I’m glad you pointed this out. The thing to avoid is 1+N queries. Using 2 or 3 queries to resolve each level is a lot less data repetition.

Query strategy is important. If you’re returning join IDs then you have already lost and just multiplied the left table data by the number of rows on the right just to get their ID. I also dislike assembling and passing lots of IDs into the second or third level query and prefer to scope the second or third level query on the prior queries constraint. I remember dealing with certain ORMs back in the day that would use the ID strategy, it’s ok with small queries but less than ideal with larger ones.

1 Like

This is true in the broader context. When considering just the database, the naïve rule of thumb is to ask your question once if you can: ideally, this means a single query which gives the database query planner the most information it can act on to most efficiently retrieve the information you want (so, from the Ecto perspective, do JOINs as needed). But as you point out, more broadly, this isn’t always the most efficient. from the application point of view.

The bigger trade-off we’re dealing with, end-to-end, is really between bandwidth and latency. A JOIN query, like a common header/detail query that includes both header data and detail data, will duplicate the header data coming from the database… and in application we only need the header data once. So our bandwidth gets consumed transferring duplicated data which we’re just going to discard at the application. BUT… when we’re doing a preload, without JOINs, we’re producing multiple queries in the application: this means multiple round trips from the application to the database, the database often is increasing its planning time because it’s initializing and planning multiple queries (PostgreSQL anyway), and we may lose the benefits of the query planner being able to more efficiently retrieve the complete dataset. These issues all contribute to increased latency in getting results you can use.

Ultimately there is an inflection point where the latency penalties from running multiple queries become smaller than the time it takes to transfer extraneous data from a single JOINed query. Where that point sits depends heavily on the exact query and the specific environmental conditions (mostly network, but DB server capability as well) where the application is running. I don’t have a great one-size-fits-all rule of thumb beyond just saying understand the principals just discussed (and maybe a little more), start with the method that feels most comfortable to code in the application, and address performance issues case by case when they make themselves known. For myself, I’ll have hand-wavy, intuition for where the breakpoint is and code accordingly; I expect I’m only marginally better than the just do it one way and deal with issues approach. If you’re going to err, err on the side of preloading since I’d expect the latencies to be small over more kinds of queries.

2 Likes

Ya, that’s the crux of it. You have to benchmark. Although I mostly don’t worry about it I’m just doing a couple of queries in a page.

In a particularly egregious case at my old job, there was a very hairy query with I think 9 joins. Converting all of those into separate queries took the total time down from over 30 seconds to under 1 second. Now, most of that 30 seconds was on network and Elixir, not in the db. But the db was returning a massive denormalize table which had to be sent over the network then Elixir had to load into memory and reduce.

But ya again, there is no rule of thumb. To dig into the case of Posts+Replies more, if you are going to be paginating replies in a LiveView, then it’s wasteful to preload them. If your only way of loading replies is through get_post this means you have to refetch the post each time you fetch a new page of replies. Not to mention you’ve now complicated your getter even further by having to add a :replies_page option.

2 Likes

It should be possible with Postgres json functions to return data in a nested shape rather than fanned out with data repitition. Functions like json_build_object(), json_agg() and row_to_json() stand out.

I haven’t looked into it, but may be something to explore and I just googled and found a 2015 post that does this for avoiding 1+N queries.

Postgres documentation here:

It is, however you are also shifting a part of your functionality and business-logic to the database. Improvement, debugging and overall knowledge required to understand what is going on becomes higher.

It very much is possible and I’ve used it with success. About 10 years ago, we wanted to take structured data pretty much directly from the database and send it to the web client to render; the data was quite complex. So we had an admittedly overcomplicated query which would build this single, large JSON object when the source data changed (not too frequently) and cached it in a column (there was other memory caching as well, but the database was done at that point). Overall it worked well. And over time it worked better as this was when the JSON handling features were just starting to be added.

You’re not wrong but from one scale and on it is very much not only worth it but also required. I’ve had a CTO step in to measure 5 different ways of slicing data from the SQL database and finally choose the one that was indeed closest to the DB because we were suffering loss of customers due to pages needing 5+ seconds to load. He made sure to thoroughly document his SQL + Ecto acrobatics, to the point the comments were 4x bigger than the query itself but IMO it was very worth it; our response times dropped to 0.6s - 0.9s only because of that (and we then reduced them further with other optimizations).

2 Likes

Pushing data query concerns to the DB layer is almost always the best bet. I prefer systems that use stored procedures and views to simplify the app, over ones where the app cobbles the query together. Using stored procedures is also far more secure and not susceptible to SQL injection attacks.

In this instance I would argue that representing query response as json is nothinh to do with business logic, it is about efficient data transfer as there is no business domain logic involved here, any time you have a join you represent the result as json, any right side row is represented as json and aggregated. We are dealing with on the wire data representation, (not business domain logic) just like Phoenix does with Liveview.

1 Like

I’ve heard about this, even though I think that in some way this is an anti-pattern, as you will slowly start moving more and more logic to stored procedures. When performance is critical and there is no other way, I totally understand that. You are also limited in development and debugging tools compared to a full-fledged language (at least this is what it felt to me), not to mention that there is specific knowledge involved about potential dangers and limitations.

Correct me if I’m wrong, however in a modern library like Ecto, how you would do a SQL injection? IMO injections are just a sign of bad software design, where data is not clearly separated from execution.

3 Likes

Nah, I wouldn’t go so far. It’s not sustainable to have one SQL wizard per team, and they become a bottleneck too (and often command insane salaries).

Elixir is not amazing when it comes to having to chew through thousands of records per second coming from the database, but it’s also doing much better compared to every other dynamic language, except JS, which is in the same or slightly better ballpark of performance.

Throw in the mix languages like Golang and Rust – f.ex. for microservices – and processing of stuff coming from the DB is almost a non-issue.

1 Like