Storing error reason for when job retries

Hey! We’re trying to handle a retry where we’d like to attach some metadata to why as the job is retrying (the reason why it failed). How would we most easily go about this?
We’re thinking of creating a database table outside of Oban to store that associated information. But wanted to check if there’s some prefered way of doing this.

We found out from #oban in the Elixir slack that the error array can be used for this purpose, but from the error array we don’t get the reason, just a error string (which includes the reason). We’d rather have the reason. Is there an easy way to do this?

Thanks!

The original, unformatted reason is available in the job’s unsaved_error field. That field is populated with the kind, reason, and stacktrace of an exception or crash after execution. That’s what is formatted and stored in the errors array, but it isn’t stored in a raw form.

You can retrieve and store the raw unsaved_error with a telemetry handler:

def handle_event([:oban, :job, :exception], _timing, %{job: job}, _conf) do
  reason = job.unsaved_error.reason
  ...
end

Storing the raw exception in a separate table seems like a corelation and maintenance burden. Perhaps you could raise a custom error instead?

defmodule MyWorker do
  use Oban.Worker

  defmodule Error do
    defexception [:message]
  end

  @impl Oban.Worker
  def perform(%{args: args}) do
    with {:error, reason} <- do_something_with_args(args) do
      raise Error, reason
    end
  end
end
1 Like

I work with @graborg and can expand on the use case. When performing the job, we call another service that may result in a partial error. On partial failures, this service returns an opaque string token that should be passed along if the call is retried. The token is used on the other end to exclude the already-performed work.

If we return {:error, token} in our worker, we see that the token is stored in the error array, but it seems unsafe to retrieve it back. I’ve hardcoded perform/1 to return {:error, "hello"} and see this string in the relevant “error” field:

"** (Oban.PerformError) MyWorker failed with {:error, \"hello\"}"

It would be possible to parse this string, but that might break if formatting changes in the future.

I see a few ways:

  • Parsing the token from the error string. feels unsafe…
  • Picking up the error reason through telemetry. Seems like a misuse of telemetry data
  • Storing the token in a table with a reference to the oban job, with delete cascade set on the foreign key so they clean up. Still might be unsafe, as the oban job is updated by Oban outside perform, and so in a different transaction
  • We can write manually to the job’s meta by updating it with Ecto while we’re still in perform/1.
  • Discarding the current job and enqueuing a different job specific to retries of partial failures, with the token in args. A bit of cognitive overhead, but perhaps not so bad
  • Adding functionality to Oban to extract the error reason safely from the string stored in errors array. That would add a maintenance burden, but we’d be happy to contribute

Do you see a better way to retrieve context from an error when retrying jobs?

That’s the error it uses when you return an {:error, reason} tuple. If you raise an exception instead it will use the formatted output of that exception. That’s why I suggested defining a custom error, you control the formatting. For example:

defmodule MyApp.Error do
  defexception [:message]
end

Exception.format(:error, MyApp.Error.exception("Token <<123-abc>>"), [])
# => "** (MyApp.Error) Token <<123-abc>>"

Then you’re in complete control of the format and can reliably parse out the token, if necessary.

That’s how error reporting from Oban works already, so not much of a misuse.

There’s no reliable way to do that from Oban itself because it will store the output of any exception. They are normalized as much as possible, but it’s unpredictable.