Ecto preloading with subquery for pagination with limit

rmoretto · February 3, 2023, 8:07pm

I’m trying to replicate the following PostgreSQL query with ecto:

SELECT *
FROM (SELECT * FROM event LIMIT 1000) as e
LEFT JOIN event_runner as er ON e.id = er.event_id

The idea is to limit only the FROM part of the query, returning the correct amount of events stopping the results from the JOINs interfering with the limit count, as it would occur if I used a LIMIT clause in the main query, as there is multiple event_runners for one event, defined as a has_many in the Event schema.

That said I trying to use the current ecto query:

event_query = from e in Event, limit: ^max_rows

from f in subquery(event_query),
  left_join: er in assoc(f, :event_runners),
  preload: [event_runners: er]

However when I run the above query I get the following error:

** (Ecto.QueryError) can only preload sources with a schema (fragments, binary and subqueries are not supported) in query:

     from e0 in subquery(from f0 in MyApp.Event,
       limit: 100,
       select: f0),
       join: e1 in MyApp.EventRunner,
       on: e1.event_id == e0.id,
       select: e0,
       preload: [event_runners: er]

The error is clear, I can’t do a preload with subquery as a source schema, but the use of preload from the join part would be ideal, as we are receiving a lot of request and for each entry returned a separated query is executed if we use the Repo.preload function.

Is there any way to get around this limitation? Or, is it possible to rewrite this piece of code to make the limit only affect the FROM clause?

joey_the_snake · February 3, 2023, 8:50pm

Postgres is smart enough not to join the entire tables before taking the limit so you can get rid of the subquery and just put the limit on the outside. At least it works on my version of postgres. Try out these 2 queries in a safe environment and compare how long they take to run

SELECT * from event e LEFT JOIN event_runner er on e.id = er.event_id LIMIT 1

SELECT * from event e LEFT JOIN event_runner er on e.id = er.event_id

If limit is happening after the entire tables are joined then they should take the same time. In my testing the first is happening instantaneously while the second takes a long time.

rmoretto · February 3, 2023, 9:02pm

Hey, thanks for the response, which version of postgres are you using?

joey_the_snake · February 3, 2023, 10:35pm

I am using Postgres 13.

But now I realized what I said only works if there is at most on event runner per event. Otherwise you’ll get 1000 results but not 1000 events if one event has more than one runner. What is the cardinality of the relationship ?

rmoretto · February 3, 2023, 11:10pm

Should have stated the cardinality in the post, the Event schema has a has_many relationship to the EventRunner. One event can have 0…n event_runners.

joey_the_snake · February 3, 2023, 11:17pm

Ah then you might need to use the separate preload query. I know it’s not what you want but it’s supposed to be more efficient anyways for one to many relationships. Otherwise your main table will have duplicated rows sent from the DB to Elixir.

joey_the_snake · February 3, 2023, 11:19pm

Or if you’re deadset against a separate preload query you would have to do some kind of homegrown solution like this

from e in subquery(....), left_join: er in EventRunner, on e.id == er.event_id, select {e, er}

Then perform the aggregation yourself on the results

rmoretto · February 3, 2023, 11:42pm

Correct me if I’m wrong, but using the Repo.preload(events, :event_runner) in a list of 1000 Events, wouldn’t the preload run 1000 queries to retrieve all EventRunners data? Looking at the application logs with level: :debug, it seems that this is the case.

joey_the_snake · February 3, 2023, 11:47pm

It would only run one query, something like select * from event_runner where event_id in (....).

rmoretto · February 3, 2023, 11:48pm

That is true, I was inspecting the wrong logs, thanks for the help!