We recently deployed some changes that included an Oban Pro upgrade from 1.4.2 to 1.4.10. When our nightly data processing jobs ran, which include a workflow with lots of steps and dependencies, we got into a weird state where the following error would periodically appear in the logs and then our feeds Oban queue that all these jobs run on would keep disappearing from the list of queues in Oban Web and jobs would just stay stuck in executing as well as all the dependent jobs being in some combo of scheduled and available. Sorry, I don’t recall exactly what the breakdown looked like. I ended up cancelling all the jobs to get us back to a good state and to stop seeing the errors.
Here are the errors.
[error] Postgrex.Protocol (#PID<0.3769.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.47785.0> timed out because it queued and checked out the connection for longer than 15000ms
#PID<0.47785.0> was at location:
(stdlib 6.0) gen.erl:241: :gen.do_call/4
(stdlib 6.0) gen_statem.erl:2633: :gen_statem.call/3
(ssl 11.2) ssl_gen_statem.erl:1267: :ssl_gen_statem.call/2
(postgrex 0.18.0) lib/postgrex/protocol.ex:3336: Postgrex.Protocol.rows_recv/5
(postgrex 0.18.0) lib/postgrex/protocol.ex:2300: Postgrex.Protocol.recv_execute/5
(postgrex 0.18.0) lib/postgrex/protocol.ex:2174: Postgrex.Protocol.bind_execute/4
(db_connection 2.7.0) lib/db_connection/holder.ex:354: DBConnection.Holder.holder_apply/4
(db_connection 2.7.0) lib/db_connection.ex:1558: DBConnection.run_execute/5
[error] GenServer {Oban.Registry, {Oban, {:plugin, Oban.Pro.Plugins.DynamicLifeline}}} terminating
** (Postgrex.Error) ERROR 57014 (query_canceled) canceling statement due to user request
(ecto_sql 3.11.3) lib/ecto/adapters/sql.ex:1054: Ecto.Adapters.SQL.raise_sql_call_error/1
(ecto_sql 3.11.3) lib/ecto/adapters/sql.ex:952: Ecto.Adapters.SQL.execute/6
(ecto 3.11.2) lib/ecto/repo/queryable.ex:232: Ecto.Repo.Queryable.execute/4
(ecto 3.11.2) lib/ecto/repo/queryable.ex:19: Ecto.Repo.Queryable.all/3
(oban_pro 1.4.10) lib/oban/pro/facilitator.ex:90: Oban.Pro.Facilitator.rescue_workflows/1
(oban_pro 1.4.10) lib/oban/pro/plugins/dynamic_lifeline.ex:183: anonymous fn/2 in Oban.Pro.Plugins.DynamicLifeline.handle_info/2
(telemetry 1.2.1) .../deps/telemetry/src/telemetry.erl:321: :telemetry.span/3
(oban_pro 1.4.10) lib/oban/pro/plugins/dynamic_lifeline.ex:180: Oban.Pro.Plugins.DynamicLifeline.handle_info/2
Last message: :rescue
Do you have any ideas what might be causing this? I was wondering if it might possibly be related to the second bullet at Changelog — Oban Pro v1.4.12.
Note that we have a single worker with nothing fancy in terms of config. Just named queues with some rate limiting. All of the data procressing jobs run on the same queue.
It’s funny you are now on the exact versions oban 2.17.12 and oban_pro 1.4.10 that we are.
For us these actually fixed the error. You can try to downgrade oban and see what happens.
Our stacktrace looked like this due to partitioned tables but underlying this error is also some kind of timeout / slowness I believe.
Postgrex.Error: ERROR 40001 (serialization_failure) tuple to be locked was already moved to another partition due to concurrent update
File "lib/ecto/adapters/sql.ex", line 1078, in Ecto.Adapters.SQL.raise_sql_call_error/1
File "lib/ecto/adapters/sql.ex", line 976, in Ecto.Adapters.SQL.execute/6
File "lib/ecto/repo/queryable.ex", line 232, in Ecto.Repo.Queryable.execute/4
File "lib/oban/pro/plugins/dynamic_lifeline.ex", line 214, in Oban.Pro.Plugins.DynamicLifeline.rescue_orphaned/1
File "lib/oban/pro/plugins/dynamic_lifeline.ex", line 181, in anonymous fn/2 in Oban.Pro.Plugins.DynamicLifeline.handle_info/2
File "/app/deps/telemetry/src/telemetry.erl", line 321, in :telemetry.span/3
File "lib/oban/pro/plugins/dynamic_lifeline.ex", line 180, in Oban.Pro.Plugins.DynamicLifeline.handle_info/2
File "gen_server.erl", line 2173, in :gen_server.try_handle_info/3
How much data you have that you decided to partition the table? I find it quite strange as the rule of thumb in a company where I worked with huge databases was to partition tables only they are bigger than 200GB with indexes.
That’s off topic for this question but it all depends on what performance you want.
We wanted to keep our completed / cancelled / discarded jobs for longer so that’s how we partitioned our table.
Now we can keep them for multiple weeks without having performance problems on the oban_jobs_incomplete table.
This was most likely caused by changes to rescuing workflows added between those versions. Without a migration to add the relevant indexes, the query to find workflows is too slow and causing the timeout.
It doesn’t matter whether you are using workflows or not, there’s no way for the lifeline to know that, so it still does the check.
FWIW, Pro v1.5 (in RC) centralizes migrations and will warn when they’re not current.
Oban changes shouldn’t be related to the issue you saw, hopefully you didn’t downgrade
Serialization failures with partitioned tables are a different story. Those come from moving jobs between partitions, which is especially touchy with the highest Postgres serialization level.
We wouldn’t recommend people to partition (using Pro’s DynamicPartitioner) unless they are running a very high number of jobs per day. However, it’s not about table size, it’s about the ability to prune without leaving bloat. The jobs table isn’t usually that large, but it will likely bloat with high throughput.
Running the index migration seems to have fixed the problem for us. It’s embarrassing how often I miss those. Warning about not being current sounds like a great idea.
This is about the only instance where a module is running a query unrelated to its primary function.
With the proper index in place, the workflow rescue query is very fast, and as mentioned above, as of Pro v1.5 all the migrations are centralized and you’ll have a loud warning when an index is missing.