Rescuing Oban jobs after a shutdown (orphaned jobs)

jswny · December 30, 2024, 1:00am

I’m relatively new to Oban, and I’m trying to get a grasp of guaranteed execution for jobs, ensuring they definitely execute across node restarts/shutdowns/cluster shuffles etc.

I thought it might be a bit easier to get an accurate way to do this, I don’t have access to Oban pro so I understand the dynamic lifeline may be a good solution.

The regular Oban lifeline plugin seems to do this by timing, ensuring that any jobs stuck in executing after a certain period of time get set back to available.

My problem with this is it scales with the max execution time for a job, if your jobs might run for 30 mins, suddenly you have to set the lifeline timeout for 35 or so, which could take a very long time to restart those jobs.

Assuming persistent node IDs (and rolling updates where the old pod is deleted first), so assuming something like a Kubernetes stateful set where the node ID is persistent, why can the application not just query for any jobs set as executing for the current node ID only when it starts up and set them back to available?

I’m wondering if there are some complexities or race conditions I may not be thinking about.