(TLDR at the end)
I spent the past two days trying to optimize one of the SQLite DBs and operations on them for what would be a rather extreme scenario for us, though not impossible if we assume bugs (my thinking is if it works under these conditions, prod will be a breeze). Unfortunately my efforts resulted in little to no success…
I searched the forum and found a few good posts and articles (e.g., Ecto & sqlite3 database lock issues - #2 by victorbjorklund, and Optimizing SQLite for servers from Ecto & sqlite3 database lock issues - #7 by BartOtten). I’ve skimmed the SQLite docs (especially the PRAGMAs) plus the ecto_sqlite3 and exqlite docs too. I’ve tried a few different things (configs, DB access patterns, indexes, …), but I can’t get much more out of the DB.
So I’ve resigned from that battle for now.
However, there’s still two things bothering me. One is that the idle time is sometimes unreasonably high (e.g. 1608.5ms vs 12.9ms db time). The other is that when a given statement takes more time than the :timeout, an error is logged (suggesting the execution was canceled), but it actually continues through to the end and even returns, because no error is raised. Here are logs showing this (used timeout: 100 to force the error):
# First DB call
[info] MachineManager synchronizing... Current version=0, target=999992
[error] Exqlite.Connection (#PID<0.400.0> ("db_conn_2")) disconnected: ** (DBConnection.ConnectionError) client #PID<0.415.0> (Foo.Office.MachineAggregator) timed out because it queued and checked out the connection for longer than 100ms
#PID<0.415.0> (Foo.Office.MachineAggregator) was at location:
(exqlite 0.35.0) lib/exqlite/sqlite3.ex:274: Exqlite.Sqlite3.multi_step/3
(exqlite 0.35.0) lib/exqlite/sqlite3.ex:312: Exqlite.Sqlite3.try_fetch_all/3
(exqlite 0.35.0) lib/exqlite/sqlite3.ex:306: Exqlite.Sqlite3.fetch_all/3
(exqlite 0.35.0) lib/exqlite/connection.ex:695: Exqlite.Connection.get_rows/2
(exqlite 0.35.0) lib/exqlite/connection.ex:641: Exqlite.Connection.execute/4
(db_connection 2.9.0) lib/db_connection/holder.ex:356: DBConnection.Holder.holder_apply/4
(db_connection 2.9.0) lib/db_connection.ex:1539: DBConnection.run_execute/5
(db_connection 2.9.0) lib/db_connection.ex:1587: DBConnection.run_with_retries/5
(db_connection 2.9.0) lib/db_connection.ex:848: DBConnection.execute/4
(ecto_sqlite3 0.22.0) lib/ecto/adapters/sqlite3/connection.ex:90: Ecto.Adapters.SQLite3.Connection.query/4
(ecto_sql 3.13.5) lib/ecto/adapters/sql.ex:1073: Ecto.Adapters.SQL.execute!/5
(ecto_sql 3.13.5) lib/ecto/adapters/sql.ex:1011: Ecto.Adapters.SQL.execute/6
(ecto 3.13.5) lib/ecto/repo/queryable.ex:241: Ecto.Repo.Queryable.execute/4
(ecto 3.13.5) lib/ecto/repo/queryable.ex:19: Ecto.Repo.Queryable.all/3
(foo 0.1.0) lib/foo/office/machine_aggregator.ex:172: Foo.Office.MachineAggregator.synchronize/2
(foo 0.1.0) lib/foo/office/machine_aggregator.ex:149: Foo.Office.MachineAggregator.handle_cast/2
(stdlib 7.3) gen_server.erl:2460: :gen_server.try_handle_cast/3
(stdlib 7.3) gen_server.erl:2418: :gen_server.handle_msg/3
(stdlib 7.3) proc_lib.erl:333: :proc_lib.init_p_do_apply/3
# Result of the first DB call
[debug] QUERY OK db=1346.3ms idle=37.3ms
SELECT s0."id", s0."type", s0."arguments", s0."emitted_at", s0."inserted_at", s0."updated_at", s0."device_id", s0."row_number" FROM (SELECT ss0."id" AS "id", ss0."type" AS "type", ss0."arguments" AS "arguments", ss0."emitted_at" AS "emitted_at", ss0."inserted_at" AS "inserted_at", ss0."updated_at" AS "updated_at", ss0."device_id" AS "device_id", row_number() OVER "device_id_window" AS "row_number" FROM (SELECT sse0."id" AS "id", sse0."type" AS "type", sse0."arguments" AS "arguments", sse0."emitted_at" AS "emitted_at", sse0."inserted_at" AS "inserted_at", sse0."updated_at" AS "updated_at", coalesce(json_extract(sse0."arguments", '$.context.device_id'), json_extract(sse0."arguments", '$.arguments.device_id')) AS "device_id" FROM "events" AS sse0 WHERE (sse0."type" IN (?,?,?,?,?)) AND (sse0."id" > ?) ORDER BY sse0."emitted_at") AS ss0 WHERE (NOT (ss0."device_id" IS NULL)) WINDOW "device_id_window" AS (PARTITION BY ss0."device_id" ORDER BY ss0."emitted_at")) AS s0 WHERE (s0."row_number" <= ?) ["v0:machine:pinged", "v0:machine:reported", "v0:machine:added_pending_action", "v0:machine:canceled_pending_action", "v0:machine:sent_package", 0, 50]
↳ Foo.Office.MachineAggregator.synchronize/2, at: lib/foo/office/machine_aggregator.ex:172
# Second DB call (we can see the first really did succeed because "Current version" is now >0)
# Extra strange is the fact that this second call doesn't log any timeout error
[info] MachineManager synchronizing... Current version=999866, target=999992
# Result of the second DB call
[debug] QUERY OK db=12.9ms idle=1608.5ms
SELECT s0."id", s0."type", s0."arguments", s0."emitted_at", s0."inserted_at", s0."updated_at", s0."device_id", s0."row_number" FROM (SELECT ss0."id" AS "id", ss0."type" AS "type", ss0."arguments" AS "arguments", ss0."emitted_at" AS "emitted_at", ss0."inserted_at" AS "inserted_at", ss0."updated_at" AS "updated_at", ss0."device_id" AS "device_id", row_number() OVER "device_id_window" AS "row_number" FROM (SELECT sse0."id" AS "id", sse0."type" AS "type", sse0."arguments" AS "arguments", sse0."emitted_at" AS "emitted_at", sse0."inserted_at" AS "inserted_at", sse0."updated_at" AS "updated_at", coalesce(json_extract(sse0."arguments", '$.context.device_id'), json_extract(sse0."arguments", '$.arguments.device_id')) AS "device_id" FROM "events" AS sse0 WHERE (sse0."type" IN (?,?,?,?,?)) AND (sse0."id" > ?) ORDER BY sse0."emitted_at") AS ss0 WHERE (NOT (ss0."device_id" IS NULL)) WINDOW "device_id_window" AS (PARTITION BY ss0."device_id" ORDER BY ss0."emitted_at")) AS s0 WHERE (s0."row_number" <= ?) ["v0:machine:pinged", "v0:machine:reported", "v0:machine:added_pending_action", "v0:machine:canceled_pending_action", "v0:machine:sent_package", 999866, 50]
↳ Foo.Office.MachineAggregator.synchronize/2, at: lib/foo/office/machine_aggregator.ex:172
I wanted to know what exactly the db/idle times meant. I found the log, and their meaning (db is the :connection_time):
:connection_time- The length of time using the connection (if a connection was used):idle_time- The amount of time the connection was idle before use
If I understand it correctly, then: db is the time using the connection; and idle is the time between a process checking out a connection and starting to use it?
I also wanted to know what the timeout error meant. I found the log, but I’m having a hard time understanding where that timer is started to continue investigating – there’s two :erlang.start_timer/4 calls, but neither uses a 4-tuple message..?
TLDR: The questions I’m left with:
- Which time(s) count(s) for the timeout error? All of them? Only
db? This could explain why the second DB call doesn’t timeout; and actually further experiments point in this direction: withtimeout: 10both calls show the error log. - What influences the
idletime? Is it even worrying? (maybe not if onlydbtime counts for the timeout?) - Why is the timeout error logged if no error is raised? Is there any scenario where the error is raised (and thus the calling process dies if it doesn’t catch)?
- If these timeout errors are worrying, how can I prevent them from happening (other than setting a higher
:timeout)? - Do people using other DBs (e.g. PostgreSQL) have similar experiences? Or is this mostly only applicable/common on SQLite?
Thanks a lot in advance!






















