Ecto query slows down when using pin operator

haidinh · April 23, 2025, 6:59am

I’m using Ecto with SQLite, and I noticed a performance slowdown when using the pin operator (^) in a query.
My database has 3 tables and 7M rows
This is the query

query =
      from(s1 in SubjectScore,
        join: s4 in SubjectScore,
        on: s1.student_id == s4.student_id,
        join: s5 in SubjectScore,
        on: s1.student_id == s5.student_id,
        where: s1.subject_id == ^1 and s4.subject_id == ^4 and s5.subject_id == ^5,
        select: %{
          id: s1.student_id,
          avg_score: fragment("(? + ? + ?) / 3.0 as avg_score", s1.score, s4.score, s5.score)
        },
        order_by: [desc: fragment("avg_score")],
        limit: 10
      )

    Repo.all(query)

Result

[debug] QUERY OK source="subject_scores" db=2952.8ms idle=1213.1ms
SELECT s0."student_id", (s0."score" + s1."score" + s2."score") / 3.0 as avg_score 
FROM "subject_scores" AS s0 
INNER JOIN "subject_scores" AS s1 ON s0."student_id" = s1."student_id"
INNER JOIN "subject_scores" AS s2 ON s0."student_id" = s2."student_id" 
WHERE (((s0."subject_id" = ?) AND (s1."subject_id" = ?)) AND (s2."subject_id" = ?)) 
ORDER BY avg_score DESC 
LIMIT 10 [1, 4, 5]

When i removed the pin operator

[debug] QUERY OK source="subject_scores" db=1601.0ms queue=0.2ms idle=1384.9ms
SELECT s0."student_id", (s0."score" + s1."score" + s2."score") / 3.0 as avg_score 
FROM "subject_scores" AS s0 
INNER JOIN "subject_scores" AS s1 ON s0."student_id" = s1."student_id" 
INNER JOIN "subject_scores" AS s2 ON s0."student_id" = s2."student_id" 
WHERE (((s0."subject_id" = 1) AND (s1."subject_id" = 4)) AND (s2."subject_id" = 5)) 
ORDER BY avg_score DESC 
LIMIT 10 []

I’ve retested this multiple times and the result is the same. Does anyone know what causing this? I’d be grateful for any suggestions to help investigate.

I found a topic mentioning a similar issue, though it was with Postgres:
https://elixirforum.com/t/ecto-query-often-timing-out-when-using-a-pin-next-to-module-attribute/40640

LostKobrakai · April 23, 2025, 7:45am

The pin operator marks a value to be passed as a parameter to a query. If you’re not passing values, which change between executions of the query you wouldn’t want to use it.

Not using the pin means the values become part of the prepared query instead of being passed as parameter.

I can’t speak for how sqlite internally handles the difference, but not needing to evaluate dynamic values is generally expected to be faster than doing that work. For more details you want to use Repo.explain/3 Ecto.Adapters.SQL — Ecto SQL v3.12.1.

haidinh · April 23, 2025, 10:52am

I understand that there will be trade-offs, but 80% slower is really strange

LostKobrakai · April 23, 2025, 10:54am

Take a look at the query plan you get from explain. It might shed some light into where the difference comes from.

haidinh · April 23, 2025, 11:42am

These are query plan I got from explain
without pin operator

[debug] QUERY OK db=0.2ms
EXPLAIN QUERY PLAN SELECT s0."student_id", (s0."score" + s1."score" + s2."score") / 3.0 as avg_score FROM "subject_scores" AS s0 INNER JOIN "subject_scores" AS s1 ON s0."student_id" = s1."student_id" INNER JOIN "subject_scores" AS s2 ON s0."student_id" = s2."student_id" WHERE (((s0."subject_id" = 1) AND (s1."subject_id" = 4)) AND (s2."subject_id" = 5)) ORDER BY avg_score DESC LIMIT 10 []
QUERY PLAN
|--SCAN s1
|--SEARCH s0 USING INDEX subject_scores_student_id_index (student_id=?)
|--SEARCH s2 USING INDEX subject_scores_student_id_index (student_id=?)
`--USE TEMP B-TREE FOR ORDER BY

with pin operator

[debug] QUERY OK db=0.1ms
EXPLAIN QUERY PLAN SELECT s0."student_id", (s0."score" + s1."score" + s2."score") / 3.0 as avg_score FROM "subject_scores" AS s0 INNER JOIN "subject_scores" AS s1 ON s0."student_id" = s1."student_id" INNER JOIN "subject_scores" AS s2 ON s0."student_id" = s2."student_id" WHERE (((s0."subject_id" = ?) AND (s1."subject_id" = ?)) AND (s2."subject_id" = ?)) ORDER BY avg_score DESC LIMIT 10 [1, 4, 5]
QUERY PLAN
|--SCAN s0
|--SEARCH s1 USING INDEX subject_scores_student_id_index (student_id=?)
|--SEARCH s2 USING INDEX subject_scores_student_id_index (student_id=?)
`--USE TEMP B-TREE FOR ORDER BY

When using pin operator, it scans s0 first, without pin it scans s1. Maybe this leads to the performance impact.

gregvaughn · April 23, 2025, 4:33pm

I’d recommend rethinking the query itself. Seems odd to join to the same table 3 times. Just use subject_score once, with a where subject_id in ^subjects and group by the student_id. Then avg_score can use the built in avg function over the score column.

Well, perhaps I’m assuming SQLite has that function. If it doesn’t, ignore me.

garrison · April 23, 2025, 4:35pm

But they are the same table, so that doesn’t seem like it would matter. There must be something weird going on with SQLite here, but I have no idea what it is.

BTW, maybe you would get better performance with something like (paraphrasing):

from row in SubjectScore,
  group_by: row.student_id,
  where: row.subject in ^subjects,
  select: {row.student_id, sum(row.score) / 3}

benwilson512 · April 23, 2025, 5:23pm

This really is the root cause. Fundamentally, Ecto doesn’t build query plans, the database does. The question to ask here is: SQLITE performs differently for parameterized vs hard coded values for the same query, why?

And that really depends on sqlite, not really anything to do with Ecto.

haidinh · April 24, 2025, 3:59am

I actually did that but i forgot to count the subject_id so I switched to join 3 table

garrison:

BTW, maybe you would get better performance with something like (paraphrasing):
from row in SubjectScore,
  group_by: row.student_id,
  where: row.subject in ^subjects,
  select: {row.student_id, sum(row.score) / 3}

I tried and their performance are quite similar

from s in SubjectScore,
      where: s.subject_id in ^subject_ids,
      group_by: s.student_id,
      having: count(s.subject_id) == 3,
      select: %{student_id: s.student_id, avg_score: fragment("avg(?) as avg_score", s.score)},
      order_by: [desc: fragment("avg_score")],
      limit: 10

[debug] QUERY OK source="subject_scores" db=1703.8ms queue=0.1ms idle=1073.2ms
SELECT s0."student_id", avg(s0."score") as avg_score FROM "subject_scores" AS s0 WHERE (s0."subject_id" IN (?,?,?)) GROUP BY s0."student_id" HAVING (count(s0."subject_id") = 3) ORDER BY avg_score DESC LIMIT 10 [1, 4, 5]

Schultzer · April 24, 2025, 3:46pm

First of all, it would be interesting to see the plan of each value, by only using one ^ to is if it happens regardless of all params or only a specific one. Another thing to try is using a prepared statement.

LostKobrakai · April 24, 2025, 4:02pm

When there’s parameters it’s using prepared statements in ecto.

regex.sh · April 24, 2025, 4:14pm

As others suggested, this is SQLite thing, and how SQLite handles parameterized queries.

From ChatGPT 4o:

With the pin operator, execution plan changes.
SQLite works properly, it starts scanning s0 as it cannot determine the selectivity of the parameterized subject_id values during planning. This less optimal join order can lead to increased execution time.

SQLite starts with s0
- Because it does not know the selectivity of the subject_id values. Since it has to plan without assuming specific values, it defaults to a generic order, usually favouring the declared table order (or whatever is heuristically cheaper).
If the subject scores table is large and indexed by subject_id, starting from the most selective (least matching) subject can drastically reduce join work.

Possible improvements :

Query refactor: Consider simplifying the query to avoid multiple joins to the same table by using GROUP BY or IN with AVG().
Indexing and ANALYZE: Ensure that the database is properly indexed, especially on student_id and subject_id, and that statistics are up to date using ANALYZE.

haidinh · April 24, 2025, 4:56pm

The query plan when I pinned the first value

[debug] QUERY OK db=0.1ms
EXPLAIN QUERY PLAN SELECT s0."student_id", (s0."score" + s1."score" + s2."score") / 3.0 as avg_score FROM "subject_scores" AS s0 INNER JOIN "subject_scores" AS s1 ON s0."student_id" = s1."student_id" INNER JOIN "subject_scores" AS s2 ON s0."student_id" = s2."student_id" WHERE (((s0."subject_id" = ?) AND (s1."subject_id" = 4)) AND (s2."subject_id" = 5)) ORDER BY avg_score DESC LIMIT 10 
[1]
QUERY PLAN
|--SCAN s0
|--SEARCH s1 USING INDEX subject_scores_student_id_index (student_id=?)
|--SEARCH s2 USING INDEX subject_scores_student_id_index (student_id=?)
`--USE TEMP B-TREE FOR ORDER BY

Second value

[debug] QUERY OK db=0.2ms
EXPLAIN QUERY PLAN SELECT s0."student_id", (s0."score" + s1."score" + s2."score") / 3.0 as avg_score FROM "subject_scores" AS s0 INNER JOIN "subject_scores" AS s1 ON s0."student_id" = s1."student_id" INNER JOIN "subject_scores" AS s2 ON s0."student_id" = s2."student_id" WHERE (((s0."subject_id" = 1) AND (s1."subject_id" = ?)) AND (s2."subject_id" = 5)) ORDER BY avg_score DESC LIMIT 10 
[4]
QUERY PLAN
|--SCAN s1
|--SEARCH s0 USING INDEX subject_scores_student_id_index (student_id=?)
|--SEARCH s2 USING INDEX subject_scores_student_id_index (student_id=?)
`--USE TEMP B-TREE FOR ORDER BY

Third value

[debug] QUERY OK db=0.1ms
EXPLAIN QUERY PLAN SELECT s0."student_id", (s0."score" + s1."score" + s2."score") / 3.0 as avg_score FROM "subject_scores" AS s0 INNER JOIN "subject_scores" AS s1 ON s0."student_id" = s1."student_id" INNER JOIN "subject_scores" AS s2 ON s0."student_id" = s2."student_id" WHERE (((s0."subject_id" = 1) AND (s1."subject_id" = 4)) AND (s2."subject_id" = ?)) ORDER BY avg_score DESC LIMIT 10 
[5]
QUERY PLAN
|--SCAN s1
|--SEARCH s0 USING INDEX subject_scores_student_id_index (student_id=?)
|--SEARCH s2 USING INDEX subject_scores_student_id_index (student_id=?)
`--USE TEMP B-TREE FOR ORDER BY

So this is actually SQLite thing.

garrison · April 24, 2025, 5:01pm

Huh, this is actually a pretty good guess. If that’s true it’s weird that it doesn’t use stats to plan the prepared query - maybe they cache the query plan per prepared statement as an optimization?

To the OP, how frequent are each of the three subjects you’re joining on?

Yeah, I figured maybe SQLite wouldn’t handle this too well. Unfortunately it’s not very good at aggregating across millions of rows.

Obviously there could be other concerns at play here, but if this is the only sort of query you’re doing something like DuckDB would probably be a couple thousand times faster.

haidinh · April 24, 2025, 5:15pm

Subject 1 is ~1M records, 4 and 5 are ~300k records.

I know, but because this is just an assignment, so I chose SQLite for its simplicity.

Thank you everyone for looking into this and helping me out.

garrison · April 24, 2025, 5:23pm

Yep, that’s it then! Strangely some docs I saw suggested that SQLite doesn’t gather stats unless you run ANALYZE, so I wonder if it really is the query planner being smart or just getting lucky. Either way running the less selective scan first is obviously going to be faster.

Thinking about it more, caching the query plan per statement does make a lot of sense so that’s probably what’s going on. I suppose if you had a case where this really mattered you could escape the values yourself and hard-code the queries the dangerous way

Schultzer · April 24, 2025, 7:23pm

Do you mean parameterized or prepared as in PostgreSQL: Documentation: 17: PREPARE

I’m not sure the latter is true since you do not want to use prepared statements all the time.

dimitarvp · April 24, 2025, 7:27pm

Why not?

Schultzer · April 24, 2025, 7:32pm

Since it can be detrimental for queries that changes a lot or not widely used. E.g delete statements rarely benefits from this, nor does queries with a dynamic where clause.

I have rarely seen Ecto use prepared statements, rather that be because Ecto hides it I don’t know. I would expect to see it in the telemetry events. Since you have to rewrite the query into a prepared one.

LostKobrakai · April 25, 2025, 6:50am

I guess the problem is which level one discusses here. Yes postgres supports “unnamed or anonymous prepared statements” (PostgreSQL: Documentation: 17: 53.2. Message Flow) where you don’t need to explicitly call PREPARE and the query thereby becoming a named prepared statement with all its caching behaviour. There’s performance reasons to opt into those unnamed prepared statements as you mention. At the database those queries are still prepared (parsed) though to be able to receive parameter bindings.

For sqlite it’s the same if not more explicit. There’s no API to execute an unprepared query. SQL strings always need to be parsed into a statement object (a prepared statement) which then optionally parameters can be bound to before executing the statement.

Ecto very much defaults to (named) prepared statements. It even caches references to reuse them across code paths. You need to opt out of that behaviour if you want it to use unnamed prepared statements.