How to use PostgreSQL's DISTINCT ON in Ecto Query

I have this SQL query:

SELECT DISTINCT ON (b.id) b.id as book_id, r.id as record_id, r.due_for_return, r.returned_at
FROM books b
left join records r
on b.id = r.book_id
order by b.id, r.id DESC

Any idea how to convert it to an Ecto Query, especially the DISTINCT ON part ? Thank you.

Maybe part of the solution:

books = (from books in Mango.Books.Book,
         left_join: records in assoc(books, :records),
         distinct: books.id,
         where: books.institute_id == ^claims["institute_id"] and is_nil(records.returned_at) != ^returned?,
         select: %{
           "record_id" => records.id,
           "title" => books.title,
           "book_id" => books.id,
           "due_for_return" => records.due_for_return,
           "returned_at" => records.returned_at},
           order_by: [books.id, desc: records.id]
)

Any comments are still welcome :slight_smile:

I have just noticed that the above query does not return the expected results. I was to select from records the row with the highest id (newest in database) only for each book_id. Seems that using DISTINCT does not answer this need.

I guess for what I am trying to do I need to run a query with a sub query: first select max id with needed fields (grouped by book_id) from b and after that join a with b.

Have you tried composing queries step by step:

https://hexdocs.pm/ecto/Ecto.Query.html#module-composition

Some issues were related to the fact that in Postgres, when using DISTINCT ON, that column must appear first in the ORDER BY. Anyway, the working SQL for what I need is this:

SELECT b.*, r.id, r.returned_at FROM books b
LEFT JOIN (
    SELECT DISTINCT ON (book_id) book_id as book_id, id, returned_at
    FROM records
    order by book_id DESC
) r
ON b.id = r.book_id
ORDER BY b.title ASC

How to write this in Ecto, God knows and the brilliant folks here :).

How to use PostgreSQL’s DISTINCT ON in Ecto Query

http://xyproblem.info/

    create table(:books) do
      add :title, :string
    end
    create table(:records) do
      add :book_id, :id
      add :due_for_return, :date
    end
# lib/books.ex
defmodule Books do
  import Ecto.Query;
  alias Books.{Repo}

  def init do
    Repo.insert_all(
      "books", [
        [title: "One"],
        [title: "Two"],
        [title: "Three"],
        [title: "Four"]
      ]
    )
    Repo.insert_all(
      "records", [
        [book_id: 1, due_for_return: ~D[2017-11-01]],
        [book_id: 1, due_for_return: ~D[2017-12-01]],
        [book_id: 1, due_for_return: ~D[2018-01-01]],
        [book_id: 2, due_for_return: ~D[2018-01-02]],
        [book_id: 2, due_for_return: ~D[2018-02-02]],
        [book_id: 3, due_for_return: ~D[2018-03-03]],
      ]
    )
  end

  def query() do
    "books"
    |> join(:left, [b], r in "records", b.id == r.book_id)
    |> select([b,r], %{
         title: b.title,
         book_id: b.id,
         record_id: max(r.id)
       })
    |> group_by([b], [b.title, b.id])
    |> subquery()
    |> join(:left, [d,r],
         r in "records", d.book_id == r.book_id and d.record_id == r.id
       )
    |> select([d,r], %{
         book_id: d.book_id,
         title: d.title,
         due: r.due_for_return
       })
    |> Repo.all()
  end
end
iex(1)> Books.init()

07:15:35.715 [debug] QUERY OK db=7.9ms
INSERT INTO "books" ("title") VALUES ($1),($2),($3),($4) ["One", "Two", "Three", "Four"]
 
07:15:35.720 [debug] QUERY OK db=1.3ms
INSERT INTO "records" ("book_id","due_for_return") VALUES ($1,$2),($3,$4),($5,$6),($7,$8),($9,$10),($11,$12) [1, {2017, 11, 1}, 1, {2017, 12, 1}, 1, {2018, 1, 1}, 2, {2018, 1, 2}, 2, {2018, 2, 2}, 3, {2018, 3, 3}]
{6, nil}
iex(2)> Books.query()

07:15:43.693 [debug] QUERY OK db=3.1ms
SELECT s0."book_id", s0."title", r1."due_for_return" FROM (SELECT b0."title" AS "title", b0."id" AS "book_id", max(r1."id") AS "record_id" FROM "books" AS b0 LEFT OUTER JOIN "records" AS r1 ON b0."id" = r1."book_id" GROUP BY b0."title", b0."id") AS s0 LEFT OUTER JOIN "records" AS r1 ON (s0."book_id" = r1."book_id") AND (s0."record_id" = r1."id") []
[
  %{book_id: 1, due: {2018, 1, 1}, title: "One"},
  %{book_id: 2, due: {2018, 2, 2}, title: "Two"},
  %{book_id: 3, due: {2018, 3, 3}, title: "Three"},
  %{book_id: 4, due: nil, title: "Four"}
]
iex(3)> 

PostgreSQL GROUP BY

4 Likes

@peerreynders, Words cannot express how grateful I am to you. Despite that I did not express clearly what I was trying to achieve, you have grasped it and provided the needed solution. In the meantime, I had written a working code, like this:

records_max_id_query = # started with the records table first..
  from(
    records in Mango.Records.Record,
    select: %{
      :max_id => max(records.id)
    },
    group_by: [records.book_id]
  )

records_query =
  from(
    records_data in Mango.Records.Record,
    join: records in subquery(records_max_id_query),
    on: records_data.id == records.max_id,
    select: %{
      :record_id => records_data.id,
      :book_id => records_data.book_id,
      :returned_at => records_data.returned_at
    }
  )

books_query =
  from(
    books in Mango.Books.Book,
    left_join: records in subquery(records_query),
    on: records.book_id == books.id,
    select: %{
      :record_id => records.record_id,
      :book_id => books.id,
      :title => books.title,
      :returned_at => records.returned_at
    },
    where: books.institute_id == ^claims["institute_id"],
    order_by: [asc: books.title]
  )

But your code looks nicer and possibly faster than the above.

I have learned much about SQL and Ecto with this, yet at the same time I come to question the design of my tables. Would the above approach work fast when there are say 100,000 records in the database? Would it not be better to have a [unique_constraint: (book_id, status:1) field in the records table indicating the active (or most recent row per book_id) and not to look for it by max(id).

Is it slow now?

I don’t know anything about PostgreSQL’s performance characteristics. And ultimately you would have to benchmark the query on your configuration.

There may be cases where an active column can be the best solution but there is the trade off of the required additional update the needs to happen within the same transaction as the insert (both of which affect the index).

Another alternative is to formulate the query around due_for_return and put an index on that but I don’t think it’s going to perform any better than using the primary key index.

1 Like

No, it is not slow. It would be interesting to do some benchmarks when the data gets bigger and vs. having an active field in records. I have generated some dummy ~5000 books and ~60,000 records, results: 68ms to count them and 134ms to return a set of 20 records with limit and offset.

You could also move the query into a view. That way the Ecto select is extremely simple and you can change the query inside the database if needed later.

PostgreSQL Views

1 Like

Hi acrolink et al,
I know I’m a bit late to this party but I was reading this for inspiration on a similar issue today and I ended up just using a fragment like so:

books_query =
  from(
    b in Mango.Books.Book,
    left_join: r in Mango.Records.Record,
    on: r.book_id == b.id,
    select: %{
      :book_id => fragment("distinct on (?) ?", b.id, b.id),
      :record_id => r.id,
      :due_for_return => r.due_for_return,
      :returned_at => r.returned_at
    },
    order_by: [asc: b.id, desc: r.due_for_return]
  )

My particular case didn’t allow for the methods outlined above so I gave that a go and it worked.

Also, I’ve sorted by due_for_return instead of id just in case something crazy happened and your record.id’s ended up out of sync with due_for_return.

I’d be interested to see how this pans out for you.

S.

5 Likes