Ecto - Validate and return which items exist in the database and which don't (using their ids)

Hello. I’ve been trying to figure out how to validate a list of ids exist in a database using Ecto. We use React so I’ve been doing it in the frontend like this:

  1. Get the ids of the items to validate and send them to Phoenix (we use Absinthe). Example:
{
	"input": ["123", "456"] <- this may be ids, strings or uuids
}
  1. Get the items that exist and return a list.
  2. In React, check which ids don’t exist in the response to get the invalid ids.

This has been causing me trouble when trying to check thousands of ids (obviously this isn’t the best way). I’m sure there’s a more efficient way of doing this using Ecto.

I’d like to get in the response a list of the valid/invalid ones. So far the only thing I’ve found is using Ecto like this (but I’d still have to check all of them individually).

Repo.exists?(from i in Item, where: i.id == ^item_id)

Any help is appreciated.

Something like item_ids -- Repo.all(from i in Item, where i.id in ^item_ids, select: i.id) should do the trick?

1 Like

The database can handle this, but it isn’t really supported by the macros provided by Ecto, so you would have to drop down to raw SQL. If you’re willing to do that, something like this could work:

values = ids |> Enum.filter(&is_integer/1) |> Enum.map(fn id -> "(#{id})" end) |> Enum.join(",")
query = "select v.id from (values #{values}) as v(id) left join items i on i.id = v.id where i.id is null;"

case Repo.query(query) do
  {:ok, %{rows: rows}} ->
    for [id] <- rows, do: id

  _ ->
    []
end

If you’d rather stick to the provided macros, dom’s answer should work. If your list can get really long you may want to consider a different data structure to potentially improve the diffing though:

query = from i in Item, where i.id in ^ids, select: i.id
existing_ids = query |> Repo.all() |> MapSet.new()
Enum.reject(ids, fn id -> MapSet.member?(existing_ids, id) end)

Why wouldn’t it be possible to use the macros like so:

query =
      from i in Item,
        left_join: v in Item,
        on: [id: i.id],
        where: i.id in ^ids and is_nil(v.id),
        select: i.id

Repo.all(query)

P.S. Every time I see some form of string interpolation in SQL queries I get goosebumps.

Bit of an XY question, but where is the frontend getting IDs that might not exist?

The goosebumps are understandable. That’s why I made sure to filter the list to make sure it only contains integers, but it still brings up security worries.

The reason the macros can’t really be used is that the IDs being provided don’t actually exist in the database, and so by trying to use from i in Item we’re telling it to only select the IDs that exist in that table. As far as I know, Ecto doesn’t support querying from non-relations using its macros, so selecting IDs that don’t actually exist will have to be done with raw SQL.

The second answer I provided (as well as the one dom provided) will maintain the safety of the macros and still get the job done, as long as OP is ok with shifting the work of diffing the list from the database into their application server.

@dnsbty unnest would allow you to at least use SQL parameters. There’s definitely no need to do string interpolation.

1 Like

Right now from spreadsheets (from an administrative team). Sometimes they input thousands of ids in a spreadsheet cell. Since they copy/paste a lot sometimes they have typo, missing numbers, invalid characters, etc. It’s weird.

Haha the second bit is the way I’m doing it in the frontend. Would this be better if I’m dealing with let’s say 2K ids over using a join?

I hadn’t used unnest before, but that looks like a great solution. Thanks for pointing it out!

So that would make my above code look like this:

query = "select v.id from unnest($1::integer[]) as v(id) left join items i on i.id = v.id where i.id is null;"

case Repo.query(query, [ids]) do
  {:ok, %{rows: rows}} ->
    for [id] <- rows, do: id

  _ ->
    []
end
1 Like

Two thousand integer IDs probably won’t be a huge performance hit either way. I would tend toward having the database return the expected result, but it will all depend on the constraints of your system and your team.

1 Like