For Oban jobs that last several minutes, maybe even an hour, what is the best practice to deal with them during queue shutdown which happens for instance when prodding a new version.
It is pointless to give them a few seconds of grace. Ideally they should be killed and re-enqueued. Maybe give them a chance to clean if need be.
I wonder why are those jobs running for so long and why you are stopping the queues in the middle of operation. Maybe Oban insn’t a good fit for the task?
Anyway, you can trap exit signals from the queue supervisor to cleanup and terminate shortly. But for doing so the job process must be able to call receive frequently.
There was a slack thread in the Oban channel similar to this: Slack
Biggest problem I was facing with these jobs is that they would be stuck in executing when the queue shut down because they don’t have time to finish
I don’t think there’s a good answer here, I think I’ve decided to just not use long running jobs because I haven’t seen a good way to avoid that outside of maybe the Oban pro lifeline plugin
Usually the queue should not be interrupted because those jobs are infrequent but a push to prod could always happen and then a new version of the app would gracefully shutdown and terminate all the queues with it.
I’d like all the guarantees and traceability of oban regardless.
Maybe a better idea would be to run that job on a different node which could continue working while the main phoenix app restart?
It’s perfectly normal for jobs to take a long time to run. You usually can’t predict when the system is going to shutdown, that’s just part of running production systems. It’s also one of the reasons to use a persistent background job system—the jobs won’t disappear after shutdown.
There are two mechanisms in Oban to deal with this: the shutdown grace period and lifeline plugins. It’s outlined in the Ready for Production guide.
Pro’s DynamicLifeline is much better at rescuing orphans and restarting them, but the regular Lifeline will get the job done too.
Doesn’t the rescue plugin take a while to do this though?
For example if I have that I expect to run for an hour don’t I have to set the rescue_after to 1 hour + some small amount?
So for example if I have an expected 1 hour that job that starts, the system dies and comes back up, the regular lifeline plugin won’t rescue it for an hour right? Effectively doubling the execution time in the worst ish case
I think there’s some good suggestions already but I also want to add that you could look at breaking down these long running jobs into workflows or a collection of smaller jobs so that you can save intermediary progress and then even if you have to kill a job you don’t have to start at the beginning.
I don’t have Oban Pro so I just created a self-inserting job. It just processes the same big pile of data but in batches of N, and then at the end of the current batch inserts another job to process the next N.
I’ve done this in 5 programming languages, Elixir included, and to me it’s the only sensible way to make sure you never lose progress in some 1M+ jobs, some of which can take minutes – while using free-as-in-beer software and/or non-“enterprise” software (free or not). Basically make sure to never enter an infinite loop which is achieved via deduplication and always saving intermediary progress. And of course by having proper job deadlines as well.
That’s a great approach whether you have Pro or not. It makes it much easier to track progress and not lose work after a restart.
Not all tasks can be subdivided. For example, CPU/GPU intensive tasks like video encoding, generating archives, or running complex models can only be broken down so far.
We’ve also focused on a normal shutdown flow here. Situations cause nodes or queues to crash all the time. Unexpected load, network traffic, sub-optimal queries, missing indexes, etc. can crash the queue and leave jobs hanging.
Some safeguard is needed to recover orphans, even if you’ve modeled your jobs to avoid them during standard operation.
True, sadly. I remember soon ago a thread here about Oban jobs doing creation of PDFs – OP asked if those can be further broken down to one Oban job per PDF page. I don’t think there was a satisfying answer there (though my memory might fail me).
I find it a good example in relation to this here topic.