1.5 Chain worker backing up in 'scheduled' state

Hey folks! We’ve switched to using the new chain worker, but we’re ending up in a situation where it seems like the jobs are all ending up in a 'scheduled' state with a schedule date all way in the future.

What’s going on with this? How can we unstick it? We’re at like 600k+ jobs in scheduling right now. For the moment I’ve been doing:

update oban_jobs set scheduled_at=now(), state = 'available', meta = jsonb_set(meta, '{on_hold}', 'false', true) where id in (
select distinct on (args['shipment_id']) id from oban_jobs where queue = 'firehose' and state = 'scheduled' order by args['shipment_id'], args['event_id'] asc
)

to basically get the earliest event for each shipment and force enqueue the job. This seems pretty hacky though.

EDIT: This doesn’t actually work all that well to unstick stuff. Not sure what’s going on but the runtimes on the jobs are quite high, (~2 seconds to do a ~40ms push to cloud pubsub) and the number of running jobs at any moment seems low.

EDIT 2: Forgot to remove the global partitioning config from the queue. Problem solved!

1 Like

Hey ya! Wen Bilson! :wink:

This is akin to a test that a cheeky teacher gives where the final question is:
Don’t answer a single question on this test.

SO glad you solved your issue! Global partitioning and the interplay with that is def the issue.

Hey @sorenone!

While this did generally clear out the queue, we’re still getting a bit of a backup for reasons that aren’t clear to me. Is there any way to query, given a specific job that is in this on hold / scheduled in far future state what it’s waiting on?

It is possible to query it. I don’t want to say, “it could be a bug..” @sorentwo :eyes:

It can happen from race conditions but it shouldn’t be a frequent occurrence. Will investigate.

If there’s some way to sync up with you all on this that might be good, we’re rolling back to the previous chain worker cause we can’t get this to stop backing up.

We’ll connect and figure out what’s going on. We’ll email you today to get a time set.

Hate to hear that you had to roll back but we understand.

As a quick followup for posterity: We did not end up rolling back. There’s a race condition in Oban with the chain worker where the query to find the next job to execute can race with a job being inserted at that moment. We have the unsticker code above running every 10 seconds or so which keeps things moving on our end. The Oban team is investigating solutions to the race condition.

5 Likes

This issue should be resolved in Pro v1.5.5 (seems likely based on outside feedback from @benwilson512). It will also be fixed in the upcoming Pro v1.6.0-rc.6 release.

Thanks for helping us get this sorted out!

3 Likes

Can confirm, everything has been running happily all day. Thanks for the quick turn around!

4 Likes