Issues with Oban Upgrade and Telemetry Events

lucasaas98 · July 1, 2024, 11:47am

Hey, there!

We’re using Oban and Oban pro. We have a main worker that pulls data from 3rd parties daily.

Here’s how we define it:

defmodule Integrations.Workers.Workflow do
  use Oban.Pro.Worker,
    queue: :integrations,
    max_attempts: 5,
    unique: [
      period: :infinity,
      keys: [:integration_id, :service],
      states: [:available, :scheduled, :executing, :retryable]
    ]

This worker then triggers a Batch which uses Oban.Pro.Workers.Batch.

The idea is that we always generate a new Integration.Workflow and schedule it 24hours later. There are 3 potential outcomes:

we have errors: in the ErrorHandler we schedule a Integrations.Workers.Workflow after 24 hours.
Here’s how we attach the telemetry:

    :telemetry.attach(
      "oban-job-error",
      [:oban, :job, :exception],
      &Integrations.Telemetry.ErrorHandler.handle_event_wrapper/4,
      []
    )

we have a discard that requires a reschedule: in the StopHandler we schedule a Integrations.Workers.Workflow after 24 hours.
Here’s how we attach the telemetry:

    :telemetry.attach(
      "oban-job-stop",
      [:oban, :job, :stop],
      &Integrations.Telemetry.StopHandler.handle_event_wrapper/4,
      nil
    )

all is well: In the custom callback worker for the Batch we schedule a new Integrations.Workers.Workflow for 24 hours later as well

All these scheduling is done like this:

    args
    |> Integrations.Workers.Workflow.new(
      queue: args["queue"],
      schedule_in: args["schedule_update_in"]
    )
    |> Oban.insert!()

Recently we updated Oban from 2.16.2 to 2.17.3 and Oban Pro from 1.1.4 to 1.3.0 because we wanted to use the DynamicCron plugin and use the scheduling guarantees.

This caused issues because we saw that now the ErrorHandler and StopHandler stopped being able to reschedule Integrations.Workers.Workflow workers (we have a log after the insert and it appears but the worker does not get rescheduled).

By skimming through the changelogs we noticed the ack_async had been added between said versions and by making it false the ErrorHandler and the StopHandler can now schedule the Workflows again.

We have tried in sandbox to upgrade the libs: oban from 2.17.3 to 2.17.10 and oban_pro from 1.3.0 to 1.4.9 but that has the same issue. (it also raised the db to 100% in prod but not sure if it’s related to using ack_async: false or not)

Has anyone faced a similar problem? If not, do you have tips on how we can change our flow in a way that prevents these issues?

Thanks!

sorenone · July 1, 2024, 2:54pm

Hey @lucasaas98,

Some people have faced a similar problem after upgrading to v1.3+ due to the async changes. Forcing the queue to disable ack_async is one way around it, but you can also disable the unique check temporarily in your telemetry hook:

args
|> Integrations.Workers.Workflow.new(
    queue: args["queue"],
    schedule_in: args["schedule_update_in"],
    unique: nil
  )
|> Oban.insert!()

The unique: nil bit is a slight hack and you’ll be able to use unique: false in the next Oban version.

On a related note, have you considered using worker hooks rather than your own telemetry handlers?

lucasaas98 · July 1, 2024, 5:06pm

Trying the unique: nil worked!

We will try changing to the worker hooks like you mentioned as well but it will come later as it requires more refactoring.

Thank you!

sorenone · July 2, 2024, 12:08am

@lucasaas98 marvelous!