Oban — Reliable and Observable Job Processing

sorentwo · March 5, 2020, 4:12pm

Oban v1.2.0 is out today!

This has a helpful bugfix to unblock queues after killing jobs, as well as a few requested features. The most exciting addition is a new telemetry event, [:oban, :started], which is emitted when a job starts. Between the :started and :success | :failure events it is now possible to do tracing spans for oban workers. This is very helpful if you use a tool like NewRelic or AppSignal and want complete traces of worker activity.

From the CHANGELOG:

Added

[Oban.Testing] Add assert_enqueued/2 and refute_enqueued/2 to allow asserting with a timeout, like assert_received. @anthonator
[Oban.Telemetry] Add [:oban, :started] event to report the system time as start_time when jobs start. This enables span tracking for jobs, improving integration with monitoring tools like NewRelic. @coladarci

Fixed

[Oban] Handle the :shutdown message when a job is killed purposefully. Previously the message was ignored, which caused the producer to keep a reference to the discarded job and prevented dispatching additional jobs.

brentjanderson · March 26, 2020, 3:37am

I’m going to chime in that I’ve got a need for Oban.delete/1 as well, or the ability to delete by job args. I’m going to just use normal Repo.delete/2 to try and remove the job before it runs.

sorentwo · March 26, 2020, 2:31pm

There isn’t an issue for adding Oban.delete/1 so it hasn’t been on my radar. If it is something you’d like to see please open an issue. Otherwise, it is perfectly fine to use Repo.delete/2 to delete jobs, they are just ecto structs.

benwilson512 · March 26, 2020, 4:07pm

Perhaps the right move would be to create some concrete examples or guides. If people should feel comfortable manipulating the Oban.Job structs as ordinary ecto schemas, some official guides or examples will go a long way towards creating that comfort I think.

@sorentwo However this does raise a question: What happens if a job is deleted while the job is in progress? This is perhaps an area where some Oban functions, whether they are query helpers or full functions come in handy. Specifically, users may want to say “delete this job which I think is still pending, but if in fact it has already started to run, don’t delete it, BUT also don’t retry it if it fails”. I know enough about how Oban works that I’m pretty sure I could write the ecto query to make that sort of thing happen, but it is I think slightly more complicated than just calling Repo.delete.

sorentwo · March 26, 2020, 8:03pm

You’re telling me! Some guides are toward the top of my list of things to work on. Writing takes even more time than coding.

That would cause a little trouble. There would be an error when the executor goes to update the function. The backoff wrapper would attempt it several more times before finally raising an error and crashing the parent task. After that the job would be gone, so that would be the end of it.

I think that is a good argument in favor of providing an Oban.delete function. It reduces the amount of internal knowledge required to gracefully handle “safe” job deletion. Combined with some docs that explain how to use Ecto to query the jobs table that would provide a complete solution.

smolcatgirl · March 27, 2020, 10:12pm

Hi, I am unsure about something. Are jobs marked as discarded only after they have been retried all the possible times?
Thanks

sorentwo · March 27, 2020, 11:56pm

Yes, that is how the flow works currently. Jobs are discarded when they have exhausted all retries. The only other official way, currently, is to discard through the UI.

There is a PR in progress to add a discard function and another that supports returning a :discard tuple from perform/2

thojanssens1 · May 13, 2020, 3:07am

Hello, I just came across Oban, and I don’t know much about it yet, but I have already a few questions.

I will use Bamboo or Swoosh to send out emails. Bamboo for example allows to send emails in background:

But I see that it doesn’t use Oban. Is delivering emails in background not a good use case for Oban? Swoosh doesn’t have a built-in mechanism to send in background, but simply mentions Task.start/1, without mentioning Oban. Is Oban overkill for these operations? When do I typically want to use Oban?

Other question is, what are the DB triggers actually for? Can someone explain in layman’s terms ? I know that a trigger allows to execute code after a DB operation, but why use that?

smaximov · May 13, 2020, 8:07am

I think that’s because deliver_later appeared in Bamboo long before Oban was created.

It’s a perfect use-case for Oban. Personally, I use Oban to compose and deliver emails in background with Swoosh.

LostKobrakai · May 13, 2020, 8:11am

It is, but also not everyone is using or wants to use Oban, so those libraries either come with their own async api or propose elixir core options.

thojanssens1 · May 14, 2020, 3:59am

What do you mean by “composing emails” with Oban?

smaximov · May 14, 2020, 9:17am

By “composing emails” I mean building a Bamboo.Email (or Swoosh.Email) struct (e.g. by YourApp.UserEmail.welcome_email) with all the data (to:, subject:, body:) necessary to send the email:

user
|> YourApp.UserEmail.welcome_email() # => composing an email here
|> YourApp.Mailer.deliver_later() # => delivering the email here, maybe asynchronously

Composing an email often involves making additional calls to the database or third-party APIs to gather necessary data to render the body of the email. Both in Bamboo and in Swoosh, this is done in the caller process by default, effectively blocking it. And if you’re composing an email inside a database transaction (which I believe you shouldn’t) and some error occurs, the transaction might be rolled back, which can be undesirable.

The other approach is to introduce a dedicated Oban worker for composing and sending emails.

defmodule YourApp.Workers.Mailer do
  use Oban.Worker, queue: "mailers"

  @impl Oban.Worker
  def perform(%{"module" => module, "name" => name, "args" => args}, _job) do
    module = String.to_existing_atom(module)
    name = String.to_existing_atom(name)

    module
    |> apply(name, args)
    |> YourApp.Mailer.deliver_now()
  end
end

With YourApp.Workers.Mailer, you can send emails like this:

%{"module" => YouApp.UserEmail, "name" => :welcome_email, "args" => [user.id]}
|> YourApp.Workers.Mailer.new()
|> Oban.insert()

So now if composing an email fails with a transient error (e.g., a network error), it will eventually succeed because Oban will restart the job. If it fails because of a bug in the code, you can always fix the bug and restart the failed job manually (assuming your Oban pruning settings allow it). You can also safely enqueue this job inside a database transaction now.

sorentwo · May 14, 2020, 1:13pm

@smaximov Great example of a simple multi-purpose mail worker

You’ll always be able to retry a job, provided there are more attempts available. Oban only prunes completed or discarded jobs and will never prune retryable jobs. There isn’t any chance of losing your mailer jobs this way.

smaximov · May 14, 2020, 1:34pm

Yes, this makes sense. But I was thinking about a situation when a job consistently fails because of a code bug. Assuming it may take some time for the user to fix the bug, the job can eventually reach the max number of retries and enter the discarded state, which makes the job subject to pruning. Though I admit with the default backoff and retries settings it will take a considerable amount of time.

Ludwik · May 26, 2020, 11:01pm

Hey @sorentwo !
I have one striking issue regarding jobs execution in multi node environment.
Please, correct me if Im wrong but I assumed that Oban guarantees that single job will be not executed by multiple nodes, i.e evey job will be run only once.
In our case, during cluster startup, each node creates Oban unique job.
We observe that only one job was inserted into oban table (which is as expected) but then we also observe that sometimes multiple nodes executes that job. Hence, we have a job that is run multiple times!
Isn’t Oban’s responsibility to prevent such scenarios with locking mechanism?

sorentwo · May 27, 2020, 2:35pm

That shouldn’t be possible. There are various locking mechanisms in play to enforce uniqueness. It handles all of that within the database queries. If there is some situation that makes it possible I definitely want to track it down. Will you open an issue and include some more details (oban version, pg version, worker options, etc)?

Ludwik · May 28, 2020, 12:27pm

Sure! I believe it may be some misunderstandment from my side so I leave an issue with tech details and configuration: Job processed by multiple nodes · Issue #250 · sorentwo/oban · GitHub

Thanks!

sorentwo · June 12, 2020, 5:52pm

Oban 2.0.0-rc.1 is out today, along with some big news about the introduction of Oban Web+Pro!

Highlights from the Oban CHANGELOG are in the post along with a breakdown of what’s gone into the new Oban Pro. Oban 2.0.0-rc.0 was silently released last week as part of the Web+Pro development process and didn’t get any fanfare

The changeset for 2.0.0 is massive and seems like too much to drop in this thread, so I’ll leave it at a link to the CHANGELOG.

Please leave any questions about the blog post, changes in Oban 2.0, or the Web+Pro package here. Alternatively, you can find me in #oban on elixir-slack.

dom · June 13, 2020, 1:23am

Congratulations!

One question about Batch. Let’s say we want to run a batch where we retry each subtask N times, then give up if it still fails. At the end of the batch, we’d send an email report showing which tasks succeeded and which failed.

Would it make sense to use the discarded state and Batch callbacks to implement this? It sounds like we would need an equivalent to handle_completed that’s fired once all tasks are either completed or permanently discarded - I get the impression it can’t be done with the current callbacks.

Or would it be better to have a catch-all within the task (or do the actual work in a separate process), so that it “completes” (from Oban’s POV) no matter what, and store the fact that it failed ourselves in DB?

sorentwo · June 13, 2020, 6:16pm

You’re correct, that can’t be done with the current callbacks. The best way to accomplish what you’re requesting is with a different callback, say handle_exhausted, which is fires when all jobs are either completed or discarded. I can see the utility in an additional callback here because there isn’t anything else to hook into at that point. I’ll add it as a feature request.

Note that depending on how aggressively you are pruning discarded jobs, or how long you’re waiting before discarding, that callback may never fire.