Pagination with scrivener, join and order by

I looked through pagination questions and I haven’t found a generic solution to the problem of pagination.

I want offset/limit style pagination with the possibility to jump to a given page and scrivener works great for that.

The only problem was that for has_many associations it returned a number of rows instead of a number of top-level entities. E.g. If I had one blog post with two comments:

Post
|> join(:inner, [p], c in assoc(p.comments)
|> where([p, c], c.inserted_at > ^yesterday)
|> Repo.paginate()

Scrivener returns two entries.

I usually worked around the issue by something like this:

Post
|> join(:inner, [p], c in assoc(p.comments)
|> where([p, c], c.inserted_at > ^yesterday)
|> preload(:comments)
|> distinct([p, c], p.id)
|> Repo.paginate

This works great. E.g. I can filter by a secondary entity (comments in here) and the DB deduplicates results at the end. I am using binding-less preload to ask Ecto to preload comments in a separate query. This pattern worked so great that I added a helper with_pagination.

def with_pagination(query) do
  query
  |> distinct([first_entity], first_entity.id)
  |> Repo.paginate
end

However, there is a problem with order_by.

Post
|> order_by([p], [asc: p.inserted_at])
|> preload(:comments)
|> with_pagination()

doesn’t work because as documentation states distinct function prepends whatever it gets to order_by. So the final query is ordered by post.id Ecto.Query — Ecto v3.5.7

What I would like to do here is to extract the order_by from query and prepend it to list from distinct.

def with_pagination(query) do
  query
  |> distinct([first_entity], __whatever is in query.order_bys__ ++ [first_entity.id])
  |> Repo.paginate
end

I am not sure it is even possible because you can use distinct only once in a query, query.order_bys is a list of QueryExpr structs and even if I extract [asc: p.inserted_at] I am not sure how to build the dynamic query in such a way that Ecto knows, that p is the same as first_entity.

We’ve ended up passing order_by arguments to with_pagination but breaks the abstraction.

How do you deal with order_by, has_many and pagination at the same time?

Bonus question: why does distinct prepend its contents to order_by instead of the other way round? In my case the reverse would make more sense. Especially given that distinct doesn’t care about order of fields and order_by does. Prepending order_by fields to distinct (if distinct is set) should work.

CC: @michalmuskala

2 Likes

This sidesteps some of your questions, but I just wonder if the root issue here is coupling the concept of distinct with pagination. It probably makes sense in your codebase, but I would not expect a function named with_pagination to automatically call distinct. I would expect it to only handle pagination rather than having opinions about the query it receives.

Regarding why distinct is prepended to order_by: If it were called outside of with_paginate I’m thinking something like |> distinct([p, c], asc: p.inserted_at, asc: p.id) would make it such that order_by is no longer necessary anyway (i.e. you usually have control over the arguments to distinct and order_by).

1 Like

Yeah, the coupling is there. If someone would like to perform a query that already uses distinct, they can’t use that function.
On the other hand when calling a function returning a page of 10 posts with preloaded comments. Regular scrivener pagination working on rows would return 10 rows (which might e.g. 3 posts with 3 comments each + 1 post with only its first comment).

I am looking for a little bit more generic way of handling that.

My dream solution would recognize if there are preloads in the query. Do the deduplication somehow. Maybe group by would be better? Or maybe do it in two queries where the first one selects only ids?
Somehow I need to tell paginator that I need n of first entity instead of n rows.

WRT distinct making order_by not necessary. That is different than how SQL handles it. Distinct is different and it looks to me like that prepending was a solution to duplicating columns in order_by and distinct. But why not the other way round? Why not make distinct no longer necessary when you have order_by?

Why not skip the join and just let ecto preload comments in a separate query? Then you won’t have problems with offset/limit based pagination. This will work out of the box if you do not add the join in the first place.

1 Like

We often need the join when the query is more complex like “order by most recently commented” or “posts where user x commented”. That is why I’d like a solution that has minimal assumptions about the query and does not interfere with joins.

For such cases you can join a subquery, which joins the relevant datapoint 1:1 to the rows of the paginated table.

Instead of querying pages and comments and filter from there:

Post | Comment
1    | 1
1    | 2
1    | 3
2    | 4
2    | 5

You’d just join the relevant piece of data for ordering:

Post | LastCommentDate
1    | 2021-01-12
2    | 2021-01-11
3    | 2021-01-10

I have to think about that approach a little more how does it fit in our usecase. I made the example with posts and comments for simplicity because everybody knows how blogs work :slight_smile: In real life we have users with pricing plans, tags, invoices and some other data and we have a big search form for that.
At present, we join 6 tables, parse query parameters and build the query from that. I’ll check if the solution with subqueries scales well. That might be a really good approach.

Just wanted to mention that it’s also possible to do a separate (more simple) query to get the total number of entries and pass that number in as an option…

my_query
|> Repo.paginate(page: page, options: %{total_entries: total_entries})

I think it is not the same thing. Even if I pass a separate query to count entries, the problem persists: e.g. I will have a wrong number of posts on each page depending on the number of comments.

@tomekowal first: thank you for opening this thread, it was what I need.

second @sweeneydavidj is right, this is just what we need:

my_query
|> Repo.paginate(page: page, options: %{total_entries: total_entries})

if you check how it’s used in the implementation
you can pass an absolute value used to do the calc in paginate

I solved having 2 queries:

  1. The main query which I have joins and it makes the default total entries lost itself.
  2. The count query
  defp form_total_entries(organizations) do
    from(f in Form,
      where: f.organization_id in ^organizations,
      select: count(f.id)
    )
    |> Repo.one()
  end

which I used this way

total_entries = form_total_entries(organizations)

query_with_joins
|> Repo.paginate(
      params
      |> Map.put(:options, %{total_entries: total_entries})
    )

I hope it helps you and who is in need.

1 Like

Since this thread just came up, I wanted to take the time to say that offset/limit style pagination is inefficient.

When you have N number of records per page and you are on page M, you are actually asking the database to load N*M records, just to discard all but N records. In more concrete terms, if you have 100 records to page and you are on page 5, you are loading 500 records, then discarding the first 400.

I will admit that this isn’t always an issue. The majority of people probably wouldn’t even go past the first or second page to try to find something if some kind of search feature is provided.

But at the end of the day, keyset pagination is more efficient.

This page does a much better job at explaining the issue. The site is also full of wonderful tips on how databases work if anyone wants to learn more.