Paginating a query with joins and preloads

Solution Edit: While preparing this question I’ve think I’ve discovered my actual issue. Originally I was just using a join: as opposed to a left_join:. I assumed it was limiting by the associations because it would return the correct data until you added speakers. I wasn’t using a distinct: parameter at the time either, returning a data set that was really hard to recognize a pattern within (at the time). It seems giving a distinct: true (which I wasn’t aware you could do) parameter as opposed to distinct: m.date returns the desired output when pass through pagination. I was preparing example data for context and my tests were… passing. :sweat_smile:

Old problem:

I’ve found a workaround but it requires an extra query to hit the database which I’d like to avoid if possible. In my experience, taking the limit/offset of a query with a preloaded association will return undesired results.

Say I have this schema:

schema "meetings" do
  field(:date, :utc_datetime)
  has_many(:speakers, Speaker)
  has_many(:attendees, Attendee)

  timestamps()
end

and this query:

@spec meetings_for_page_query() :: Ecto.Query.t()
def meetings_for_page_query() do
  from(m in Meeting,
    join: s in assoc(m, :speakers),
    where: m.date < ^DateTime.utc_now(),
    order_by: [desc: m.date],
    preload: [:speakers]
  )
end

passing it through:

defp paged_query(query, page, per_page) do
  offset = per_page * (page - 1)

  query
  |> limit(^(per_page + 1))
  |> offset(^offset)
end

will return a result where I believe it is including the preloaded associations as part of the limit.

My current solution is just to preload the associations after pagination, retrieving and replacing the map from the paginated results like so:

def index(conn, params) do
  page = params["page"] || 1
  per_page = params["per_page"] || 5
  paged_result = Pagination.paginate(Meetings.meetings_for_page_query(), page, per_page)

  preloaded_results =
    Map.get(paged_result, :results)
    |> Repo.preload(:speakers)

  paged_result_w_preload =
    paged_result
    |> Map.put(:results, preloaded_results)

  render(conn, "index.html", paged_result: paged_result)
end

I’ll be giving a short talk around a feature I added to a local Elixir meetup group’s website and wanted to see if anyone had any other solutions that I could add in and show different ways to approach the problem. Even a popular Elixir pagination library Scrivener has had this issue open for over a year (though I haven’t tried using their library to see if they’ve since updated it and just haven’t cleared the issue).

If anyone is interested, here is the powerpoint I’ll be showing. If you have any suggestions please feel free!

2 Likes

Please always post code examples in actual text and not via images. Images don’t work well on everyone’s various screen sizes, zoom levels, and so on. As importantly, you force anyone who wants to suggest a small change to manually type out the whole thing instead of just copying, pasting, and editing.

For limit / offset based queries an extra database call is probably the right way to go here. The performance you lose by another round trip is still an overall win because you return less duplicate data in the initial request, plus the limit logic actually works.

If you used a where some_col > style pagination then you can try to do it in one query with window functions but two queries is still probably easier.

4 Likes

Thanks for the advice, I’ve updated the post without pictures.

I doubt our data set will ever get to the point where the performance hit would matter, but I’d like to make sure I understand what you mean exactly. If I’m thinking correctly, preloading the speakers after paginating will return only the speakers for that page instead of the speakers for every meeting across every page. Is this what you mean by less duplicate data? How is this beneficial?

I’ve also never seen that style of pagination you’re referring to, do you have any examples?

I’ll be looking in to implementing GraphQL into an upcoming Elixir project here soon, so it’s quite a coincidence you happen to comment here. Lots of good resources under your name, to say the least. I appreciate it! I’m still trying to wrap my head around the difference between GraphQL and Phoenix. Phoenix will still control the data flow and send the query but GraphQL will return the data in a more useful/user friendly way?

These are great questions, let’s work in reverse here.

These projects are essentially orthogonal to one another. Phoenix’s job is to route and handle HTTP requests. It doesn’t care if you use GraphQL anything else to generate a response. Absinthe’s job is to run a GraphQL query. it doesn’t care if that query comes in via HTTP or a message bus or email (but don’t do that).

Absinthe.Plug is how you connect Phoenix and Absinthe. Inside your Phoenix router you’ll have a route defined which points to Absinthe.Plug and your GraphQL schema, and that’ll pass along HTTP requests containing a GraphQL query to Absinthe so it can run it against your schema.

This is sort of a larger topic, and if you spend some time google about pagination approaches the stuff covered there will be generally relevant. In Elixir there is GitHub - duffelhq/paginator: Cursor-based pagination for Elixir Ecto to help with a better approach. If you want Connection style pagination in GraphQL you can that with Paginator easily enough by using: Absinthe.Relay.Connection — absinthe_relay v1.5.2.

Fundamentally this has to do with SQL joins. A meeting with has many speakers. Let’s assume there are 2 speakers within a meeting with id 123. If you do from(m in Meeting, join: s in assoc(m, :speakers), where: m.id == 123) how many rows are returned? Even though there’s just one meeting, you actually get two rows back in the SQL result. If you had 10 speakers you’d get 10 results back. Each row contains ALL of the columns for the meeting, plus ALL of the columns for speakers. On the first row you have the values of the meeting under the meeting columns and all of the values of the first speaker under the speaker columns. On the second row you REPEAT ALL of the values of the meeting, and then under the speaker columns you have the second speaker values.

Ecto can deduplicate this information when it’s turning it back into Elixir structs, but even if you average just two speakers per meeting you’re 25% of the data you’re going to send back is just duplication of the meeting info.

This also affects limit offset. If you do limit 10, you’re not limiting the results to 10 meetings, you’re limiting yourself to 10 result rows. Each result row is a single meeting / speaker pair. If a single meeting has 10 speakers then you’ll get really just 1 meeting back, even if you intended to return ten meetings back regardless of how many speakers there are.

Distinct helps here but all of that is work arounds around removing the duplication that you yourself introduced. It’s frequently better in 1:N associations to just do the associations as a secondary database query unless you need them for filtering.

6 Likes