How to get the top X results of a given category using Ecto?

The SO answer is from 2013. I’d start using lateral joins for getting a “top 3 per group” today, which is simpler to deal with than CTE and window functions. Lateral joins are essentially a custom subquery per row joined to, which imo mappes much more to the intend.

That would look something like this:

top = from top in "tops", where: top.group_id == parent_as(:group).id, limit: 3
from group im "groups", as: :group, lateral_join: top3 in subquery(top), on: true, select: …
3 Likes

The SO examples cover that. If you can show a sample using the row_number snippet, that would already help a lot.

So you would have a different table with everything ordered and grouped?

You can join the same table, if the data is just in one table.

I am not following your proposal.
I understand here that:

top = from top in "tops", where: top.group_id == parent_as(:group).id, limit: 3

You have 1 table, with all things ordered, and then you pick the first 3.
This is not exactly what I am looking for, what I am trying to do is:

  • Having everything grouped by category (where a category is home_id + path) and within each category pick the first 3 items ordered by id.

Am I missing something from your explanation?

I think you’re trying to read to much into the example. This is only meant to show the syntax (especially the parent_as). Adjust the queries itself to fetch whatever data you need.

1 Like

Reading from this post about lateral joins:

I believe what I want in SQL could be something like this:

(assuming file_info_with_counts is the inner join of all_counts with file_info):

SELECT home_id, path, first_item, other_items FROM
  (SELECT home_id, path, min(item_id) AS first_item FROM file_info_with_counts GROUP BY home_id, path) o1
  INNER JOIN LATERAL
  (SELECT id, item_id AS other_items
   FROM file_info_with_counts
   WHERE home_id = o1.home_id and path = o1.path AND item_id > o1.item_id
   ORDER BY item_id ASC LIMIT 2)
   o2 ON true;

Now, if this makes sense, I need to find a way to convert it into Ecto.



    Repo.insert_all(ExampleRecord, [
      [name: "John", value: 20],
      [name: "John", value: 7],
      [name: "John", value: 7],
      [name: "John", value: 7],
      [name: "John", value: 4]
    ])

    table =
      Ecto.Query.from(t in ExampleRecord,
        select: %{
          rn: over(row_number(), partition_by: [t.name], order_by: [desc: t.value]),
          name: t.name,
          value: t.value
        }
      )

    Ecto.Query.from(r in Ecto.Query.subquery(table), where: r.rn <= 3, order_by:  r.name)
    |> Repo.all()

The :partition_by option of over() can be a list of fields, if that helps.

2 Likes

I see, so using row_number I would have to separate the inner query from the outer one and thus separate the queries. Rather interesting, I didn’t see this !

Thanks !
I will now test both solutions and see which one gets me the results I am looking for !

I don’t think you need the first subquery here:

tops = 
  from top in "file_info_with_counts", 
    where: top.home_id == parent_as(:parent).home_id,
    where: top.path == parent_as(:parent).path,
    order_by: [asc: top.item_id],
    limit: 3,
    select: %{id: top.id}

from parent in "file_info_with_counts",
  as: :parent,
  group_by: [parent.home_id, parent.path],
  lateral_join: top in subquery(tops), 
  on: true,
  select: %{home_id: parent.home_id, path: parent.path, item_id: top.id}
2 Likes

I ended up going with @kwando solution, as I had more practice experimenting with it:

file_info_with_counts
|> select([fi], %{
        rn: over(row_number(), partition_by: [fi.home_id, fi.path], order_by: [asc: fi.item_id]),
        item_id: fi.item_id,
       # you get the idea ...
      })
|> subquery()

IO.inspect(file_info_with_counts |> where([c], c.rn <= 3) |> Repo.all()

Which prints what I wanted.

The solution from @LostKobrakai is however not without merit. However, given my familiarity with row_number I opted for that solution instead.

Unless there is a considerable performance difference between the two in favor of lateral joins, I will keep the previous solution.

There is a considerable readability difference at least :sweat_smile:

Lateral joins parse a lot easier on the human brain.

Citation needed :smiley:

Lateral joins parse a lot easier on the human brain.

Oliver Mulelid-Tynes

On a more serious note, the selected solution above requires you to parse row_number, requires you to mentally parse the paritioning etc. Then feed all of that into over().

A lateral join using ON true literally reads as “this select will happen for every row in the orignal query”.

1 Like

Agreed — I recently did some fancy Ecto composability stuff and lateral joins are awesome.

You should run the explain plan on both queries before choosing one imo. Sometimes there can be a significant difference in performance between the 2. Only start caring about the readability if the performance is roughly the same.

2 Likes

I’d argue that you should only care about the readability until you have some evidence that the performance matters. :slight_smile:

Edit to add: Joey’s advice to understand the performance characteristics of both is good! The only bit I potentially disagree with is when you should care about the performance.

Maybe when you’ve been presented with two alternatives? :wink:

I personally wouldn’t use this philosophy for writing queries. If there is a difference in performance then there’s some inefficiency in the query traversal. I wouldn’t want to waste time re-evaluating the impact of this inefficiency each time my data scaled by X%. We’re talking about SQL here so the number of ways you can turn the code into an unreadable mess is not as open as your application code.

In my opinion the difference in readability/ease of understanding between CTE/window/partition and lateral join is pretty overstated. Maybe lateral join is easier at first for most people but with a modest amount of SQL experience it shouldn’t take more than a day or two of focus learning CTE/window/partition.

If someone on my team said we should use lateral join over the CTE solution even though it is performing worse, they would be quickly told to get comfortable with CTE/window/partition. That shouldn’t even be a debate for any serious project, IMO.

A general comment, I think in this field there is way too much emphasis placed on things being easy. Or maybe it’s just a vocal minority that I keep hearing from. Look at what mathematicians, physicists, chemists etc get comfortable with in order to do their job/communicate with each other. Are we really saying computer scientists are so different that they can’t read window/partition?

That would require that your evaluation stays correct when the stored data scaled by X%, which isn’t necessarily the case. There are tradeoffs involved in how much ahead in time you plan and build for.

I do agree with this one, though mostly due to the context of ecto, where a CTE or a subquery or a lateral join can look almost the same if you compose the nested queries separately.

I don’t think you’re getting why people try to keep things easy. It’s not about being able to grasp the alternative or not. It’s much more about the fact that in software we usually work in a context of uncertainty.

Yes you can spend an hour today trying to optimize it for an imagined future or you can put the simple solution in use, get more information and spend that hour at a later time when you’re more certain about the context the optimization is needed for. Which path is more useful depends on how certain you are/can be about future requirements and not technical ability.

2 Likes

I think @LostKobrakai covered most of it.

My reply was a bit glib – I believe neither of us was actually “correct,” because what is correct is rather complicated and dependent on circumstances.

There are no true absolutes here: You shouldn’t ignore performance, nor should you ignore readability. But you should acknowledge that each has trade-offs and focusing on one over the other may make sense at times.

1 Like