We had a little accident with a production system, which usually manages to process the jobs as they come in, with no problems.
We noticed that a bunch of jobs (20k) were scheduled that shouldn’t have been scheduled. Me, and another developer, we opened Oban Web, clicked around, and then I cancelled the 20,000 jobs using iex session with Oban.cancel_all/1
. So far so good.
The jobs seem to have disappeared and we kept observing what’s going on using the Oban Web panel.
Quickly I noticed nothing happens. The jobs were not being processed. In the UI, the queue in question was toggling from “paused” to “running” state but there were barely any jobs in “executing” state.
Finally, the queue disappeared from the list of queues entirely in the Oban.Web.
I tried restarting the app, Oban itself, the database, nothing helped - the issue was very quickly reappearing.
When trying to query for the queue status using iex I was getting:
Oban.check_queue(queue: :my_queue)
** (exit) exited in: GenServer.call({:via, Registry, {Oban.Registry, {Oban, {:producer, "my_queue"}}}}, :check, 5000)
** (EXIT) time out
(elixir 1.13.4) lib/gen_server.ex:1030: GenServer.call/3
At the same time the CPU usage on our Google Cloud SQL PostgreSQL instance was at 100% and we were getting a lot of disconnected errors from Postgrex - also from other parts of the app.
I finally closed the tab with Oban Web, and asked the other developer to do the same, and the issue resolved itself almost immediately, and the jobs started being executed. Once the queue emptied, so did our CPU usage.
I think what was happening is that Oban.Queue.Producer
was getting stuck/timeouts/crashing, as a result of something hammering either database or it from Oban Web. At no point in time we had more than 25k jobs in the system, so even with the moderately small database instance we have on production (6GB one) that should not be the case I think.
Are we missing some indexes that Oban.Web needs to use to query the system effectively? @sorentwo any ideas?