Oban jobs stuck at "scheduled" and then "available" if I move them there manually

@sorentwo or anyone really I may need some help with Oban.

My set up:

oban 2.19.4
oban_pro 1.6.2
oban_web 2.11.1

Since couple of days, some jobs are getting stuck in “scheduled”. I think this is related to amount of jobs scheduled in particular queue. I have currently 3.5k jobs “scheduled” and most of their scheduled_at time has already passed, but they have not executed.

In fact, they did not move from “scheduled” to “available”, and when I click “Run now” in Oban Web, they move to “available” but do not get executed either.

Other queues are executing just fine.

I have already tried reindexing oban tables, but the jobs are still stuck in “scheduled”.

Any ideas much appreciated.

My Oban Config:

  defp oban_config do
    [
      log: false,
      repo: DB.Repo,
      engine: Oban.Pro.Engines.Smart,
      plugins: [
        {
          Oban.Pro.Plugins.DynamicPruner,
          state_overrides: [
            cancelled: {:max_age, {5, :days}},
            completed: {:max_age, {5, :days}},
            discarded: {:max_age, {7, :days}}
          ]
        },
        {
          Oban.Pro.Plugins.DynamicCron,
          timezone: "America/Chicago", crontab: []
        },
        {Oban.Plugins.Cron,
         crontab: [
           #/ 10 crontab entries removed here/ 
         ]},
        Oban.Pro.Plugins.DynamicLifeline,
        Oban.Plugins.Reindexer
      ],
      queues: [
         my_queue1: [limit: 3]
         #/ 20+ queues with limit 1-10 below/
      ]
    ] 
  end

This has happened yesterday on one queue, and I rescheduled the jobs (again around 3.5k) by inserting them in smaller batches, now it’s happening with a different queue with about the same number of scheduled jobs.

I suspect something breaks with high number of jobs, but honestly 3.5k is not that much.

Related: I managed to “fix” the issue somehow but that’s one time. Not sure why it worked.

What I did, is the following:

  1. From psql, I updated the jobs queue to a different queue. This did not work, they were not executing.
  2. From psql, I updated their status to ‘available’. This didn’t help either.
  3. Then another job that normally executes on the other queue was inserted. This triggered all of these availalble jobs to be executed.

Maybe I have something messed up with pg_notify or similar, but I thought these queues were supposed to do polling too.

I also have quite a few of these in logs on database:

“duplicate key value violates unique constraint “oban_jobs_unique_index””

BUT I suspect these are expected if one uses unique jobs (?). I have thousands of these.

Do these happen to use the chain worker feature?

1 Like

No, I don’t use chain worker feature.

There’s a possibility this is contributing to the problem if you’re using postgrex prior to v0.20, as there’s a bug with reconnections that makes the notifier get stuck in a disconnected state.

This is almost certainly the reason processing got stuck. Here’s the tl;dr to unstick it with a SQL query:

update oban_jobs set meta = meta - 'uniq_key' where state in ('retryable', 'scheduled') and meta ? 'uniq_key';

This happens because of partial unique states. If you have something like unique: [:available, :executing] , and it doesn’t apply to :retryable or :scheduled , then you can end up in a situation where jobs go to transition from scheduled -> available and there’s a unique conflict.Postgres only raises a single conflict exception at once, and that’s what the engine tries to use to fix the unique issue. However, with enough conflicts, it gets stuck in a loop and the jobs don’t progress.

That’s why the next Oban release has unique “groups”, rather than encouraging people to use individual states: oban/guides/learning/unique_jobs.md at main · oban-bg/oban · GitHub

It shows up after Pro v1.5 because it uses unique indexes, which actually enforce uniqueness all the time. In OSS and older Pro versions it used a combination of queries during insert—that wasn’t transactionally safe, and it made it easy to write broken state combinations.

1 Like

@sorentwo thank you so much.

I do have postgrex v0.20.0 so that’s not it.

On the unique jobs, I think you’re right to pin point it as a cause. I have this in my jobs: states: [:available, :scheduled, :executing]

I can run the command to unblock the jobs, but a long term fix will be adding :retryable to the states list or is there a different fix needed?

You’ve got it! That’s the correct long term fix.

1 Like