Background
I have to queries that return a colossal amount of data on their own. I cannot use Repo.all
as doing so would materialize these into memory, which would then quickly run out.
So I am trying to push as much as I can to the pSQL DB, and force the DB to do as much work as possible.
My issue starts with 2 queries.
This ones counts fruits and veggies and aggregates everything into a neat map.
all_counts =
table_A
|> join(:left, [item_A], item_B in table_B,
on:
item_A.home_id == item_B.home_id and
item_A.path == item_B.path
)
|> select([unfiltered_item, filtered_item], %{
path: item_A.path,
item_fruits_count: coalesce(item_A.fruits, 0),
item_veggies_count: coalesce(item_B.veggies, 0),
dataset_id: item_A.home_id
})
|> subquery()
The second one, joins 2 tables as well (items and photos), with nothing fancy:
file_info =
table_C
|> join(:inner, [item], file in table_D,
on:
item.id == file.item_id and not file.deleted
)
|> select([item, file], %{
item_id: item.id,
home_id: item.home_id,
path: item.path,
photo_key: file.photo_key
})
|> subquery()
Problem
Now the problem is that I need to merge these 2 together.
At first, one would think to do something like this:
result =
all_counts
|> join(:inner, [c], f in ^file_info, on: c.home_id == f.home_id and c.path == f.path)
|> select([c, f], %{
item_id: f.item_id,
home_id: f.home_id,
path: f.path,
photo_key: f.photo_key,
# ... you get the idea
})
|> Repo.all()
But this creates an issue, namely, the it will return so much data, the machines will run out of memory.
Approach
The approach I am using to solve this problem is to group items by home_id
and path
(since that is unique for each destination) and then return only a portion of the data I need, lets say, the top 3 items ordered by id.
Source:
Here is where my difficulties begin.
I cannot use pSQL directly, I must use Ecto (for reasons beyond this post).
Normally I would use CTEs or row_number()
:
With ctes:
WITH cte AS
( SELECT name, value,
ROW_NUMBER() OVER (PARTITION BY name
ORDER BY value DESC
)
AS rn
FROM t
)
SELECT name, value, rn
FROM cte
WHERE rn <= 3
ORDER BY name, rn ;
With row_number
:
SELECT name, value, rn
FROM
( SELECT name, value,
ROW_NUMBER() OVER (PARTITION BY name
ORDER BY value DESC
)
AS rn
FROM t
) tmp
WHERE rn <= 3
ORDER BY name, rn ;
However, I am not familiar enough with Ecto to know how to use them.
With CTEs, I understand I should avoid them, as they serve no purpose in Ecto:
https://hexdocs.pm/ecto/Ecto.Query.html#with_cte/3
With row_number()
I would need to partition by both home_id
and path
(2 fields) instead of one:
https://hexdocs.pm/ecto/Ecto.Query.WindowAPI.html#row_number/0
Question
How do I get the result
, to return the top 3 results, grouped by home_id
and path
and ordered by item_id
using Ecto?