How to deal with very long running Oban jobs during queue shutdown?

daniel-gagnon · January 29, 2025, 7:16pm

For Oban jobs that last several minutes, maybe even an hour, what is the best practice to deal with them during queue shutdown which happens for instance when prodding a new version.

It is pointless to give them a few seconds of grace. Ideally they should be killed and re-enqueued. Maybe give them a chance to clean if need be.

How do we proprely notify / kill them ?

lud · January 29, 2025, 8:45pm

Hello and welcome to the forums

I wonder why are those jobs running for so long and why you are stopping the queues in the middle of operation. Maybe Oban insn’t a good fit for the task?

Anyway, you can trap exit signals from the queue supervisor to cleanup and terminate shortly. But for doing so the job process must be able to call receive frequently.

jswny · January 29, 2025, 8:46pm

There was a slack thread in the Oban channel similar to this: Slack

Biggest problem I was facing with these jobs is that they would be stuck in executing when the queue shut down because they don’t have time to finish

I don’t think there’s a good answer here, I think I’ve decided to just not use long running jobs because I haven’t seen a good way to avoid that outside of maybe the Oban pro lifeline plugin

daniel-gagnon · January 29, 2025, 9:19pm

Usually the queue should not be interrupted because those jobs are infrequent but a push to prod could always happen and then a new version of the app would gracefully shutdown and terminate all the queues with it.

I’d like all the guarantees and traceability of oban regardless.

Maybe a better idea would be to run that job on a different node which could continue working while the main phoenix app restart?

sorentwo · January 29, 2025, 9:22pm

It’s perfectly normal for jobs to take a long time to run. You usually can’t predict when the system is going to shutdown, that’s just part of running production systems. It’s also one of the reasons to use a persistent background job system—the jobs won’t disappear after shutdown.

There are two mechanisms in Oban to deal with this: the shutdown grace period and lifeline plugins. It’s outlined in the Ready for Production guide.

Pro’s DynamicLifeline is much better at rescuing orphans and restarting them, but the regular Lifeline will get the job done too.

jswny · January 29, 2025, 10:41pm

Doesn’t the rescue plugin take a while to do this though?

For example if I have that I expect to run for an hour don’t I have to set the rescue_after to 1 hour + some small amount?

So for example if I have an expected 1 hour that job that starts, the system dies and comes back up, the regular lifeline plugin won’t rescue it for an hour right? Effectively doubling the execution time in the worst ish case

I may be misunderstanding

benwilson512 · January 29, 2025, 10:47pm

I think there’s some good suggestions already but I also want to add that you could look at breaking down these long running jobs into workflows or a collection of smaller jobs so that you can save intermediary progress and then even if you have to kill a job you don’t have to start at the beginning.

jswny · January 29, 2025, 10:50pm

Yeah this is the route I went!

I don’t have Oban Pro so I just created a self-inserting job. It just processes the same big pile of data but in batches of N, and then at the end of the current batch inserts another job to process the next N.

I’ve found this is easier to work with

dimitarvp · January 30, 2025, 12:44am

I was afraid to suggest this with the expectation of being shut down with “really, Sherlock? but we actually can’t”.

dimitarvp · January 30, 2025, 12:45am

I’ve done this in 5 programming languages, Elixir included, and to me it’s the only sensible way to make sure you never lose progress in some 1M+ jobs, some of which can take minutes – while using free-as-in-beer software and/or non-“enterprise” software (free or not). Basically make sure to never enter an infinite loop which is achieved via deduplication and always saving intermediary progress. And of course by having proper job deadlines as well.

sorentwo · January 30, 2025, 10:08am

That’s a great approach whether you have Pro or not. It makes it much easier to track progress and not lose work after a restart.

Not all tasks can be subdivided. For example, CPU/GPU intensive tasks like video encoding, generating archives, or running complex models can only be broken down so far.

We’ve also focused on a normal shutdown flow here. Situations cause nodes or queues to crash all the time. Unexpected load, network traffic, sub-optimal queries, missing indexes, etc. can crash the queue and leave jobs hanging.

Some safeguard is needed to recover orphans, even if you’ve modeled your jobs to avoid them during standard operation.

dimitarvp · January 30, 2025, 4:51pm

True, sadly. I remember soon ago a thread here about Oban jobs doing creation of PDFs – OP asked if those can be further broken down to one Oban job per PDF page. I don’t think there was a satisfying answer there (though my memory might fail me).

I find it a good example in relation to this here topic.