How to get the top X results of a given category using Ecto?

Background

I have to queries that return a colossal amount of data on their own. I cannot use Repo.all as doing so would materialize these into memory, which would then quickly run out.

So I am trying to push as much as I can to the pSQL DB, and force the DB to do as much work as possible.

My issue starts with 2 queries.

This ones counts fruits and veggies and aggregates everything into a neat map.

all_counts =
      table_A
      |> join(:left, [item_A], item_B in table_B,
        on:
          item_A.home_id == item_B.home_id and
            item_A.path == item_B.path
      )
      |> select([unfiltered_item, filtered_item], %{
        path: item_A.path,
        item_fruits_count: coalesce(item_A.fruits, 0),
        item_veggies_count: coalesce(item_B.veggies, 0),
        dataset_id: item_A.home_id
      })
      |> subquery()

The second one, joins 2 tables as well (items and photos), with nothing fancy:

file_info =
      table_C
      |> join(:inner, [item], file in table_D,
        on:
          item.id == file.item_id and not file.deleted
      )
      |> select([item, file], %{
        item_id: item.id,
        home_id: item.home_id,
        path: item.path,
        photo_key: file.photo_key
      })
      |> subquery()

Problem

Now the problem is that I need to merge these 2 together.
At first, one would think to do something like this:

result =
      all_counts
      |> join(:inner, [c], f in ^file_info, on: c.home_id == f.home_id and c.path == f.path)
      |> select([c, f], %{
        item_id: f.item_id,
        home_id: f.home_id,
        path: f.path,
        photo_key: f.photo_key,
        # ... you get the idea
      })
      |> Repo.all()

But this creates an issue, namely, the it will return so much data, the machines will run out of memory.

Approach

The approach I am using to solve this problem is to group items by home_id and path (since that is unique for each destination) and then return only a portion of the data I need, lets say, the top 3 items ordered by id.

Source:

Here is where my difficulties begin.
I cannot use pSQL directly, I must use Ecto (for reasons beyond this post).

Normally I would use CTEs or row_number():

With ctes:

 WITH cte AS
  ( SELECT name, value,
           ROW_NUMBER() OVER (PARTITION BY name
                              ORDER BY value DESC
                             )
             AS rn
    FROM t
  )
SELECT name, value, rn
FROM cte
WHERE rn <= 3
ORDER BY name, rn ;

With row_number:

SELECT name, value, rn
FROM 
  ( SELECT name, value,
           ROW_NUMBER() OVER (PARTITION BY name
                              ORDER BY value DESC
                             )
             AS rn
    FROM t
  ) tmp 
WHERE rn <= 3
ORDER BY name, rn ; 

However, I am not familiar enough with Ecto to know how to use them.

With CTEs, I understand I should avoid them, as they serve no purpose in Ecto:
https://hexdocs.pm/ecto/Ecto.Query.html#with_cte/3

With row_number() I would need to partition by both home_id and path (2 fields) instead of one:
https://hexdocs.pm/ecto/Ecto.Query.WindowAPI.html#row_number/0

Question

How do I get the result, to return the top 3 results, grouped by home_id and path and ordered by item_id using Ecto?

How would you write your full query in plain SQL?

The SO answer is from 2013. I’d start using lateral joins for getting a “top 3 per group” today, which is simpler to deal with than CTE and window functions. Lateral joins are essentially a custom subquery per row joined to, which imo mappes much more to the intend.

That would look something like this:

top = from top in "tops", where: top.group_id == parent_as(:group).id, limit: 3
from group im "groups", as: :group, lateral_join: top3 in subquery(top), on: true, select: …
2 Likes

The SO examples cover that. If you can show a sample using the row_number snippet, that would already help a lot.

So you would have a different table with everything ordered and grouped?

You can join the same table, if the data is just in one table.

I am not following your proposal.
I understand here that:

top = from top in "tops", where: top.group_id == parent_as(:group).id, limit: 3

You have 1 table, with all things ordered, and then you pick the first 3.
This is not exactly what I am looking for, what I am trying to do is:

  • Having everything grouped by category (where a category is home_id + path) and within each category pick the first 3 items ordered by id.

Am I missing something from your explanation?

I think you’re trying to read to much into the example. This is only meant to show the syntax (especially the parent_as). Adjust the queries itself to fetch whatever data you need.

1 Like

Reading from this post about lateral joins:

I believe what I want in SQL could be something like this:

(assuming file_info_with_counts is the inner join of all_counts with file_info):

SELECT home_id, path, first_item, other_items FROM
  (SELECT home_id, path, min(item_id) AS first_item FROM file_info_with_counts GROUP BY home_id, path) o1
  INNER JOIN LATERAL
  (SELECT id, item_id AS other_items
   FROM file_info_with_counts
   WHERE home_id = o1.home_id and path = o1.path AND item_id > o1.item_id
   ORDER BY item_id ASC LIMIT 2)
   o2 ON true;

Now, if this makes sense, I need to find a way to convert it into Ecto.



    Repo.insert_all(ExampleRecord, [
      [name: "John", value: 20],
      [name: "John", value: 7],
      [name: "John", value: 7],
      [name: "John", value: 7],
      [name: "John", value: 4]
    ])

    table =
      Ecto.Query.from(t in ExampleRecord,
        select: %{
          rn: over(row_number(), partition_by: [t.name], order_by: [desc: t.value]),
          name: t.name,
          value: t.value
        }
      )

    Ecto.Query.from(r in Ecto.Query.subquery(table), where: r.rn <= 3, order_by:  r.name)
    |> Repo.all()

The :partition_by option of over() can be a list of fields, if that helps.

2 Likes

I see, so using row_number I would have to separate the inner query from the outer one and thus separate the queries. Rather interesting, I didn’t see this !

Thanks !
I will now test both solutions and see which one gets me the results I am looking for !

I don’t think you need the first subquery here:

tops = 
  from top in "file_info_with_counts", 
    where: top.home_id == parent_as(:parent).home_id,
    where: top.path == parent_as(:parent).path,
    order_by: [asc: top.item_id],
    limit: 3,
    select: %{id: top.id}

from parent in "file_info_with_counts",
  as: :parent,
  group_by: [parent.home_id, parent.path],
  lateral_join: top in subquery(tops), 
  on: true,
  select: %{home_id: parent.home_id, path: parent.path, item_id: top.id}
1 Like

I ended up going with @kwando solution, as I had more practice experimenting with it:

file_info_with_counts
|> select([fi], %{
        rn: over(row_number(), partition_by: [fi.home_id, fi.path], order_by: [asc: fi.item_id]),
        item_id: fi.item_id,
       # you get the idea ...
      })
|> subquery()

IO.inspect(file_info_with_counts |> where([c], c.rn <= 3) |> Repo.all()

Which prints what I wanted.

The solution from @LostKobrakai is however not without merit. However, given my familiarity with row_number I opted for that solution instead.

Unless there is a considerable performance difference between the two in favor of lateral joins, I will keep the previous solution.

There is a considerable readability difference at least :sweat_smile:

Lateral joins parse a lot easier on the human brain.

Citation needed :smiley:

Lateral joins parse a lot easier on the human brain.

Oliver Mulelid-Tynes

On a more serious note, the selected solution above requires you to parse row_number, requires you to mentally parse the paritioning etc. Then feed all of that into over().

A lateral join using ON true literally reads as “this select will happen for every row in the orignal query”.

1 Like

Agreed — I recently did some fancy Ecto composability stuff and lateral joins are awesome.

You should run the explain plan on both queries before choosing one imo. Sometimes there can be a significant difference in performance between the 2. Only start caring about the readability if the performance is roughly the same.

2 Likes

I’d argue that you should only care about the readability until you have some evidence that the performance matters. :slight_smile:

Edit to add: Joey’s advice to understand the performance characteristics of both is good! The only bit I potentially disagree with is when you should care about the performance.

Maybe when you’ve been presented with two alternatives? :wink:

I personally wouldn’t use this philosophy for writing queries. If there is a difference in performance then there’s some inefficiency in the query traversal. I wouldn’t want to waste time re-evaluating the impact of this inefficiency each time my data scaled by X%. We’re talking about SQL here so the number of ways you can turn the code into an unreadable mess is not as open as your application code.

In my opinion the difference in readability/ease of understanding between CTE/window/partition and lateral join is pretty overstated. Maybe lateral join is easier at first for most people but with a modest amount of SQL experience it shouldn’t take more than a day or two of focus learning CTE/window/partition.

If someone on my team said we should use lateral join over the CTE solution even though it is performing worse, they would be quickly told to get comfortable with CTE/window/partition. That shouldn’t even be a debate for any serious project, IMO.

A general comment, I think in this field there is way too much emphasis placed on things being easy. Or maybe it’s just a vocal minority that I keep hearing from. Look at what mathematicians, physicists, chemists etc get comfortable with in order to do their job/communicate with each other. Are we really saying computer scientists are so different that they can’t read window/partition?