I’ve got some jobs that are “stuck”. Eg, one is scheduled 20 hours ago and is in available on attempt 0 of 15. It’s part of a workflow and has deps, but searching meta.workflow_id:0198ec97-ad63-783d-9c6b-12f1c6a15728 in all states finds only this stuck job in available.
The queue has config global_limit: [allowed: 1, partition: [fields: [:args], keys: [:regression_model_id]]], local_limit: 5. Yesterday I got some of these jobs to run by temporarily bumping that to 2, but that’s not very safe. This is on oban_pro 1.6.3, oban 2.20.1, oban_met 1.0.3, oban_web 2.11.4. For config we’ve got notifier: Oban.Notifiers.PG.
The Oban version looks good, but a fix in Pro v1.6.4 may help with the stuck processing.
The fact that there’s a job in a workflow with deps that is marked available means all the deps have already completed (or cancelled/discarded depending on your config).
I just ran into this, where jobs were queued yesterday and they sat in the queue untouched, the root problem looked like the partition_key on the job records were nil, not sure how it could get into this scenario, running Oban Pro 1.6.2
Was this following an upgrade from v1.5, and if so, did you upgrade to v1.5.4 first to have it pre-generate the partition keys as suggested in the upgrade guide?
No I had done the upgrade a while ago actually. I resolved it by turning off the global limit to allow it to process the jobs with no partition_key(wasn’t sure how else to get through it). I did have it happen locally too with one job, but I have yet to figure out what scenario triggered it.
@sorentwo actually before I upgrade, even new jobs added to the queue have a null partition_key, is there anything I should debug before making any changes?
So these records in particular are actually created via the AshOban extension
I just tried creating a new job in that queue without using the AshOban extension and it does has a partition_key, coincidentally creating one via the AshOban extension using AshOban.run_triggers(record, trigger) now has a partition_key
So it seems like it’s working now and I’ll try the upgrade, but wondering why it would even get into this scenario?
We run multiple instances of our application; a few instances have the queues configured to run on them, and a few instances don’t, but we still start up Oban in the application with queues: []
When the job is inserted by a non Oban Worker there is no partition_key, but if I insert a job via the Oban Worker(one with queues) there is a partition key
That’s strange. That’s a common setup and it is designed to check for partition information on other node types.
Will you please share the result of Oban.config() on the web and worker node? You can omit all the plugins and all of the queue information aside from the partitioned one causing trouble.
Unfortunately I had to get production going so I went with the upgrade and it has things working again.
I can reproduce this locally though where I start up an iex shell without my OBAN_WORKER flag enabled, and no other instances running, and trying to insert a job into a partitioned queue yields no partition_key, would the config of this instance still help?