Using postgres values list in ecto

Gday! We have a need to filter a query by very many (our current upper bound is ~70k) uuids provided by an external service.

We found that we can get acceptable query times (~120ms on a 100k table matching the full 70k) is psql by using a values list and joining across our indexed table. We got the idea from here.

This doesn’t seem to be supported in ecto/postgrex, however, and we are having trouble getting it to work in a fragment.

fragment("VALUE ?", ["uuid", ...]) and fragment("VALUE ?", [{"uuid", ...}]), and even interpolating the ids into a string (I felt dirty doing it) return something like:

** (Postgrex.Error) ERROR 42601 (syntax_error): syntax error at or near "$1"
[debug] QUERY ERROR source="answers" db=0.8ms
SELECT ... FROM "answers" AS a0 INNER JOIN (VALUES $1) AS f1 ON TRUE [["uuid", ...]]

Interpolating the ids straight into a string like “VALUES (‘uuid’), …” returns an ArgumentError about SQL injection, as it rightly should, and as we want to happen.

Any help in constructing a fragment for a value list would be greatly appreciated. Cheers.

1 Like

Trying this locally, Ecto (or … postgrex, I supose) is turning the list into ARRAY[..]. So if you aren’t using a version of postgres that has the performance regressions around ANY (ARRAY[..]) you can instead write fragment("ANY ?", [..]) (or even just fragment("?", [..])) and it will create a nicely formatted ARRAY string … if that’s an option it is probably the easiest.

We’re on 9.6, so we were thinking the same. It only trimmed our test queries down from ~2.3s to ~1.6s though. It’s when we tried the JOIN that we saw actually acceptable performance, so it’s either this or using temp tables, and we’d really like to go this route if we can.

Hey @brentonannan, did you try just doing essentially from a in Answer, where: a.value in ^values ?

@benwilson512 yep, that’s what we have now, and it’s serving us fine when the filtering set is < ~1000 or so, but we’re expecting higher volume on this particular query, and we’re showing ~2.3s in tests with an IN.

Ah my bad, a long time ago in used to do array any, but evidently it does the more straight forward in now.

In any case if you want to use any you can do

User |> where([u], u.id == fragment("ANY(?)", [1,2])) |> Repo.all
4 Likes

You may want to also consider doing an UNNEST JOIN against the list if they’re IDs.

Is there any way we can use VALUES (1), (2) now? Even the = ANY(?) version is much slower compared to being able to use VALUES

Update d.id == fragment("ANY(SELECT unnest(?))", type(^ids, {:array, :binary_id})) works much better, above question is still valid though :slight_smile:

Can you provide the methods you’re using to benchmark this? https://stackoverflow.com/questions/34627026/in-vs-any-operator-in-postgresql and the related links seem to indicate that they ought to perform equivalently.

IN and = ANY do perform the same.
But if the array you pass to IN ? and = ANY(?) is long enough it starts to be much slower compared to = ANY(VALUES (1), (2), ...) or = ANY(SELECT unnest((1,2,3)))

Update: in my case I was doing a query over a 2M rows table with a 3.5k items array passed to IN (or = ANY) and the change to unnest changed the query time from 12 seconds to 2

Update 2: so supposed that x is a list with 3.5k elements, where y IN x and where y = any(x) do perform roughly the same, while where y = any(select unnest(x)) is much faster

Got it. Is there any downside to the unnest option? does it add appreciable overhead for small lists?

I’ve tried to run some pgbench for a while (with pgbench table with 100k and 10M elements, using 10, 200 and 3k elements array) but I haven’t been able to find a case that reproduces the speed improvement I’m seeing in my case (switching to unnest or values), instead the unnest and values methods are slower
As far as I know using WHERE IN or ANY with a long list is well known to be slow and inefficient.
So if these numbers are valid (I’m not good enough on postgresql to understand why of this, maybe because it parses every time the query and/or when ecto passes the array as parameter instead of inside the query is much faster) there’s no reason to change the IN to use unnest as proposed here
If I’ll have time I’ll try to use ecto with benchee directly

These are the results of running pgbench for 30 seconds with 30 clients with 100k rows in the table:

With 10 elements:

  • 36ktps using IN (...)
  • 36ktps using = ANY ARRAY[...]
  • 18ktps using = ANY (select unnest(ARRAY[...]))
  • 17ktps using = ANY (VALUES (.), (.), ...)

With 200 elements:

  • 4.2ktps using IN (...)
  • 4ktps using = ANY ARRAY[...]
  • 3.5ktps using = ANY (select unnest(ARRAY[...]))
  • 3.5ktps using = ANY (VALUES (.), (.), ...)

With 3000 elements:

  • 440tps using IN (...)
  • 415tps using = ANY ARRAY[...]
  • 380tps using = ANY (select unnest(ARRAY[...]))
  • 370tps using = ANY (VALUES (.), (.), ...)

these are the results of running pgbench for 30 seconds with 30 clients with 10M rows in the table:

With 10 elements:

  • 33ktps using IN (...)
  • 33ktps using = ANY ARRAY[...]
  • 15ktps using = ANY (select unnest(ARRAY[...]))
  • 16ktps using = ANY (VALUES (.), (.), ...)

With 3000 elements:

  • 380tps using IN (...)
  • 355tps using = ANY ARRAY[...]
  • 339tps using = ANY (select unnest(ARRAY[...]))
  • 370tps using = ANY (VALUES (.), (.), ...)
3 Likes