Hi! We have following oban versions:
Oban v2.17.12
Oban.Web v2.9.7
Oban.Pro v1.4.13
Elixir:
erlang 27.0.1
elixir 1.17.2-otp-27
We have 20.8 million of messages in available state, all from same queue.
And those are stuck in that state.
Application restart does not help.
In oban web, we can not load available jobs, because we got connection error:
13:39:59.043 [error] Ch.Connection (#PID<0.12344.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.94175.0> timed out because it queued and checked out the connection for longer than 15000ms
Hi, thank you for reply.
Database instance is on rds, setup up for production. Yes, traffic is too much due to database connection errors.
Actually my question is what does job status available means. Is it for jobs that failed to successfully finish?
There are several possible reasons why jobs won’t run, but it seems specifically due to those timeout errors. Some immediate changes you can make to try and get the system unstuck:
Rebuild the indexes (reindex table oban_jobs), force analyze (vacuum analyze oban_jobs)
Increase the size of your RDS instance (it may be fine for normal usage, but not for a backlog of 20m jobs)
Manually move most of those jobs to a scheduled state to space them out. Something like this could work to reduce the available count down to 100k jobs:
update oban_jobs set state = 'scheduled', scheduled_at = now() + '1 hour'::interval
where id < (select min(id) from oban_jobs where state = 'available') - 100000
Beyond that, here are some other suggestions once the system is moving again:
Upgrade to Oban v2.19 and Pro v1.5+, primarily for the new check_available/1 implementation that optimizes the query to check for available jobs (benchmark info in the linked changelog)
Upgrade to Oban Web v2.10+ (v2.11 if possible), because it uses Oban.Met for much less database impact (especially count estimates)
This doesn’t appear to be the same issue. This issue is from database queries timing out, most likely the staging query that checks for available jobs. Restarting the app can’t fix the performance of that query; it either needs a more powerful DB or improved queries.
Since restarting the app fixed your scenario, it looks more like a pubsub failure. If you’re using the Postgres notifier and postgrex prior to v0.20, there is a known issue with silent disconnects that broke pubusb.
@sorentwo, thank you very much for this advice. We first did 1 and 3, and that helped.
We changed 3. to:
WITH jobs_to_update AS (
SELECT id
FROM oban_jobs
WHERE state = 'available'
LIMIT 100000
)
UPDATE oban_jobs
SET state = 'scheduled',
scheduled_at = now() + INTERVAL '1 hour'
WHERE id IN (SELECT id FROM jobs_to_update);
Because for first query we got UPDATED 0 records.
I have a question for status discarded. When I click on a discarded job, I can see 20/20 retries with status completed. What does it mean completed? That we depleted max number of retirees and that job was not actually done?
In that case, could we also apply advice from point 3?
Also, I checked Oban docs, but could not found what is job lifecycle between states.
We also followed instructions for official scaling guide: