I’m using Ecto with SQLite, and I noticed a performance slowdown when using the pin operator (^) in a query.
My database has 3 tables and 7M rows
This is the query
query =
from(s1 in SubjectScore,
join: s4 in SubjectScore,
on: s1.student_id == s4.student_id,
join: s5 in SubjectScore,
on: s1.student_id == s5.student_id,
where: s1.subject_id == ^1 and s4.subject_id == ^4 and s5.subject_id == ^5,
select: %{
id: s1.student_id,
avg_score: fragment("(? + ? + ?) / 3.0 as avg_score", s1.score, s4.score, s5.score)
},
order_by: [desc: fragment("avg_score")],
limit: 10
)
Repo.all(query)
Result
[debug] QUERY OK source="subject_scores" db=2952.8ms idle=1213.1ms
SELECT s0."student_id", (s0."score" + s1."score" + s2."score") / 3.0 as avg_score
FROM "subject_scores" AS s0
INNER JOIN "subject_scores" AS s1 ON s0."student_id" = s1."student_id"
INNER JOIN "subject_scores" AS s2 ON s0."student_id" = s2."student_id"
WHERE (((s0."subject_id" = ?) AND (s1."subject_id" = ?)) AND (s2."subject_id" = ?))
ORDER BY avg_score DESC
LIMIT 10 [1, 4, 5]
When i removed the pin operator
[debug] QUERY OK source="subject_scores" db=1601.0ms queue=0.2ms idle=1384.9ms
SELECT s0."student_id", (s0."score" + s1."score" + s2."score") / 3.0 as avg_score
FROM "subject_scores" AS s0
INNER JOIN "subject_scores" AS s1 ON s0."student_id" = s1."student_id"
INNER JOIN "subject_scores" AS s2 ON s0."student_id" = s2."student_id"
WHERE (((s0."subject_id" = 1) AND (s1."subject_id" = 4)) AND (s2."subject_id" = 5))
ORDER BY avg_score DESC
LIMIT 10 []
I’ve retested this multiple times and the result is the same. Does anyone know what causing this? I’d be grateful for any suggestions to help investigate.
The pin operator marks a value to be passed as a parameter to a query. If you’re not passing values, which change between executions of the query you wouldn’t want to use it.
Not using the pin means the values become part of the prepared query instead of being passed as parameter.
I can’t speak for how sqlite internally handles the difference, but not needing to evaluate dynamic values is generally expected to be faster than doing that work. For more details you want to use Repo.explain/3Ecto.Adapters.SQL — Ecto SQL v3.12.1.
These are query plan I got from explain
without pin operator
[debug] QUERY OK db=0.2ms
EXPLAIN QUERY PLAN SELECT s0."student_id", (s0."score" + s1."score" + s2."score") / 3.0 as avg_score FROM "subject_scores" AS s0 INNER JOIN "subject_scores" AS s1 ON s0."student_id" = s1."student_id" INNER JOIN "subject_scores" AS s2 ON s0."student_id" = s2."student_id" WHERE (((s0."subject_id" = 1) AND (s1."subject_id" = 4)) AND (s2."subject_id" = 5)) ORDER BY avg_score DESC LIMIT 10 []
QUERY PLAN
|--SCAN s1
|--SEARCH s0 USING INDEX subject_scores_student_id_index (student_id=?)
|--SEARCH s2 USING INDEX subject_scores_student_id_index (student_id=?)
`--USE TEMP B-TREE FOR ORDER BY
with pin operator
[debug] QUERY OK db=0.1ms
EXPLAIN QUERY PLAN SELECT s0."student_id", (s0."score" + s1."score" + s2."score") / 3.0 as avg_score FROM "subject_scores" AS s0 INNER JOIN "subject_scores" AS s1 ON s0."student_id" = s1."student_id" INNER JOIN "subject_scores" AS s2 ON s0."student_id" = s2."student_id" WHERE (((s0."subject_id" = ?) AND (s1."subject_id" = ?)) AND (s2."subject_id" = ?)) ORDER BY avg_score DESC LIMIT 10 [1, 4, 5]
QUERY PLAN
|--SCAN s0
|--SEARCH s1 USING INDEX subject_scores_student_id_index (student_id=?)
|--SEARCH s2 USING INDEX subject_scores_student_id_index (student_id=?)
`--USE TEMP B-TREE FOR ORDER BY
When using pin operator, it scans s0 first, without pin it scans s1. Maybe this leads to the performance impact.
I’d recommend rethinking the query itself. Seems odd to join to the same table 3 times. Just use subject_score once, with a where subject_id in ^subjects and group by the student_id. Then avg_score can use the built in avg function over the score column.
Well, perhaps I’m assuming SQLite has that function. If it doesn’t, ignore me.
But they are the same table, so that doesn’t seem like it would matter. There must be something weird going on with SQLite here, but I have no idea what it is.
BTW, maybe you would get better performance with something like (paraphrasing):
from row in SubjectScore,
group_by: row.student_id,
where: row.subject in ^subjects,
select: {row.student_id, sum(row.score) / 3}
This really is the root cause. Fundamentally, Ecto doesn’t build query plans, the database does. The question to ask here is: SQLITE performs differently for parameterized vs hard coded values for the same query, why?
And that really depends on sqlite, not really anything to do with Ecto.
I actually did that but i forgot to count the subject_id so I switched to join 3 table
I tried and their performance are quite similar
from s in SubjectScore,
where: s.subject_id in ^subject_ids,
group_by: s.student_id,
having: count(s.subject_id) == 3,
select: %{student_id: s.student_id, avg_score: fragment("avg(?) as avg_score", s.score)},
order_by: [desc: fragment("avg_score")],
limit: 10
[debug] QUERY OK source="subject_scores" db=1703.8ms queue=0.1ms idle=1073.2ms
SELECT s0."student_id", avg(s0."score") as avg_score FROM "subject_scores" AS s0 WHERE (s0."subject_id" IN (?,?,?)) GROUP BY s0."student_id" HAVING (count(s0."subject_id") = 3) ORDER BY avg_score DESC LIMIT 10 [1, 4, 5]
First of all, it would be interesting to see the plan of each value, by only using one ^ to is if it happens regardless of all params or only a specific one. Another thing to try is using a prepared statement.
As others suggested, this is SQLite thing, and how SQLite handles parameterized queries.
From ChatGPT 4o:
With the pin operator, execution plan changes.
SQLite works properly, it starts scanning s0 as it cannot determine the selectivity of the parameterized subject_id values during planning. This less optimal join order can lead to increased execution time.
SQLite starts with s0
Because it does not know the selectivity of the subject_id values. Since it has to plan without assuming specific values, it defaults to a generic order, usually favouring the declared table order (or whatever is heuristically cheaper).
If the subject scores table is large and indexed by subject_id, starting from the most selective (least matching) subject can drastically reduce join work.
Possible improvements :
Query refactor: Consider simplifying the query to avoid multiple joins to the same table by using GROUP BY or IN with AVG().
Indexing and ANALYZE: Ensure that the database is properly indexed, especially on student_id and subject_id, and that statistics are up to date using ANALYZE.
[debug] QUERY OK db=0.1ms
EXPLAIN QUERY PLAN SELECT s0."student_id", (s0."score" + s1."score" + s2."score") / 3.0 as avg_score FROM "subject_scores" AS s0 INNER JOIN "subject_scores" AS s1 ON s0."student_id" = s1."student_id" INNER JOIN "subject_scores" AS s2 ON s0."student_id" = s2."student_id" WHERE (((s0."subject_id" = ?) AND (s1."subject_id" = 4)) AND (s2."subject_id" = 5)) ORDER BY avg_score DESC LIMIT 10
[1]
QUERY PLAN
|--SCAN s0
|--SEARCH s1 USING INDEX subject_scores_student_id_index (student_id=?)
|--SEARCH s2 USING INDEX subject_scores_student_id_index (student_id=?)
`--USE TEMP B-TREE FOR ORDER BY
Second value
[debug] QUERY OK db=0.2ms
EXPLAIN QUERY PLAN SELECT s0."student_id", (s0."score" + s1."score" + s2."score") / 3.0 as avg_score FROM "subject_scores" AS s0 INNER JOIN "subject_scores" AS s1 ON s0."student_id" = s1."student_id" INNER JOIN "subject_scores" AS s2 ON s0."student_id" = s2."student_id" WHERE (((s0."subject_id" = 1) AND (s1."subject_id" = ?)) AND (s2."subject_id" = 5)) ORDER BY avg_score DESC LIMIT 10
[4]
QUERY PLAN
|--SCAN s1
|--SEARCH s0 USING INDEX subject_scores_student_id_index (student_id=?)
|--SEARCH s2 USING INDEX subject_scores_student_id_index (student_id=?)
`--USE TEMP B-TREE FOR ORDER BY
Third value
[debug] QUERY OK db=0.1ms
EXPLAIN QUERY PLAN SELECT s0."student_id", (s0."score" + s1."score" + s2."score") / 3.0 as avg_score FROM "subject_scores" AS s0 INNER JOIN "subject_scores" AS s1 ON s0."student_id" = s1."student_id" INNER JOIN "subject_scores" AS s2 ON s0."student_id" = s2."student_id" WHERE (((s0."subject_id" = 1) AND (s1."subject_id" = 4)) AND (s2."subject_id" = ?)) ORDER BY avg_score DESC LIMIT 10
[5]
QUERY PLAN
|--SCAN s1
|--SEARCH s0 USING INDEX subject_scores_student_id_index (student_id=?)
|--SEARCH s2 USING INDEX subject_scores_student_id_index (student_id=?)
`--USE TEMP B-TREE FOR ORDER BY
Huh, this is actually a pretty good guess. If that’s true it’s weird that it doesn’t use stats to plan the prepared query - maybe they cache the query plan per prepared statement as an optimization?
To the OP, how frequent are each of the three subjects you’re joining on?
Yeah, I figured maybe SQLite wouldn’t handle this too well. Unfortunately it’s not very good at aggregating across millions of rows.
Obviously there could be other concerns at play here, but if this is the only sort of query you’re doing something like DuckDB would probably be a couple thousand times faster.
Yep, that’s it then! Strangely some docs I saw suggested that SQLite doesn’t gather stats unless you run ANALYZE, so I wonder if it really is the query planner being smart or just getting lucky. Either way running the less selective scan first is obviously going to be faster.
Thinking about it more, caching the query plan per statement does make a lot of sense so that’s probably what’s going on. I suppose if you had a case where this really mattered you could escape the values yourself and hard-code the queries the dangerous way
Since it can be detrimental for queries that changes a lot or not widely used. E.g delete statements rarely benefits from this, nor does queries with a dynamic where clause.
I have rarely seen Ecto use prepared statements, rather that be because Ecto hides it I don’t know. I would expect to see it in the telemetry events. Since you have to rewrite the query into a prepared one.
I guess the problem is which level one discusses here. Yes postgres supports “unnamed or anonymous prepared statements” (PostgreSQL: Documentation: 17: 53.2. Message Flow) where you don’t need to explicitly call PREPARE and the query thereby becoming a named prepared statement with all its caching behaviour. There’s performance reasons to opt into those unnamed prepared statements as you mention. At the database those queries are still prepared (parsed) though to be able to receive parameter bindings.
For sqlite it’s the same if not more explicit. There’s no API to execute an unprepared query. SQL strings always need to be parsed into a statement object (a prepared statement) which then optionally parameters can be bound to before executing the statement.
Ecto very much defaults to (named) prepared statements. It even caches references to reuse them across code paths. You need to opt out of that behaviour if you want it to use unnamed prepared statements.