Oban: why are failed jobs pruned?

thojanssens1 · November 17, 2022, 9:16am

When a job fails, I thought to find the full error in the database. However, jobs that are discarded (e.g. in case of failure and max attempt reached) are being pruned. From the documentation:

Pruning is only applied to jobs that are completed, cancelled or discarded. It’ll never delete a new job, a scheduled job or a job that failed and will be retried.

Oban — Oban v2.9.2 (I’m using Oban 2.9.2 at the moment).

I was wondering why that choice as we want to find the errors, but those disappear by the pruner?

I have such a case where something wrong happened, and I lost all trace of the error.

So first question: why isn’t there an option to keep discarded jobs because of failure in the table (and why wouldn’t that even be the default)?
Second question: how to best keep trace of errors happening in Oban workers?

csadewa · November 17, 2022, 9:36am

Probably because discarded is a terminal state (like completed and cancelled), that’s why it is deleted by pruning.

isn’t there an option to keep discarded jobs because of failure in the table (and why wouldn’t that even be the default)?

most likely because there’s hasn’t been a pull request for the feature yet. Oban project seems welcome contribution, which you could contact on:

Request an invitation and join the oban channel on [Slack] (https://elixir-lang.slack.com/)

Ask questions and discuss Oban on the Elixir Forum

cc @sorentwo

thojanssens1 · November 17, 2022, 10:09am

I rather think that I didn’t get the base idea for handling failed jobs. Oban has been designed in such way for a reason, and is very mature already, so I do not think it’s just a mishap and a possible PR for improvement, but I think it’s rather a lack of understanding of the concept of Oban error handling on my side.

For me it seems natural to have an option to keep failed jobs in the database for debugging purposes, but apparently that’s not the choice made.

csadewa · November 17, 2022, 11:16am

based on Oban.Plugins.Pruner — Oban v2.13.5 (hexdocs.pm) , it seems Oban provide as paid software DynamicPruner, which enable more granular custom behaviour

This plugin treats all jobs the same and only retains by time. To retain by length or provide custom rules for specific queues, workers and job states see the DynamicPruner plugin in Oban Pro.

alternatively, by plugin design of Oban, you could extend / make similar Pruner plugin which able to handle your custom behaviour.

sorentwo · November 17, 2022, 6:43pm

You’re correct that this is by design. By default, each job has 20 attempts, and the standard backoff extends to 12 days. You can increase the number of attempts or tweak the backoff to extend that backoff period. If a job has failed 20 times over the course of nearly two weeks, chances are further attempts won’t succeed either.

Either through logging (https://hexdocs.pm/oban/Oban.html#module-instrumentation-and-logging) or telemetry-powered reporting (Oban — Oban v2.17.1).

thojanssens1 · November 23, 2022, 8:29am

After further reflecting on this, I thought that it might still be interesting to have a simple option to leave discarded jobs in the DB.

Some apps may not be actively monitored, and failed jobs may stay for several weeks in the db unnoticed (until max attempt is reached). When max attempt is reached, the jobs are lost forever. Maybe we really need to send that data to the customer, or whatever the job was supposed to do, but it’s just gone.

What are your thoughts on that?

benwilson512 · November 23, 2022, 8:41am

The jobs should never be the canonical store of business state. If an invoice or data needs to go to the customer, that need should be modeled and stored in your regular tables. You lose the reason the job failed with pruning, but you shouldn’t be losing the need or intent.

Ultimately though because oban_jobs is just a table, you can sort of do whatever you want here. You could not run the built in pruner and write / use your own pruner, you could copy failed jobs over to a different table at some interval, there’s a lot of options.

thojanssens1 · November 23, 2022, 8:48am

Indeed I can do anything but the question is more about best practices and reasoning.

Makes a lot of sense, thank you.