Oban — Reliable and Observable Job Processing

If it doesn’t change the code much, could be something you could list as a suggestion if available.

This is purely a comparison of how the libraries are structured and their features. Between building Kiq and Oban I investigated nearly every library in the ecosystem (EctoJob, Exq, Honeydew, Que, Rihanna and Verk). I learned something from each one and owe all of the authors a debt of gratitude.

There are far too many differences between the various libraries to summarize them here. Therefore I’m going to cheat a bit and make a table highlighting the differences between only the libraries you asked about.

Feature Oban EctoJob Rihanna
perform return output doesn’t matter multi transaction success tuple
job scheduling triggered, polled triggered, polled polled
args storage jsonb jsonb erlang terms
error retention all full historic errors none last error
execution time unlimited configured timeout unlimited
orphaned jobs rescued, guarded by locks inside a transaction guarded by locks
queues multiple with single table one per table single
queue limits configured limit per queue configured limit per queue configured globally
queue changes pause, resume, scale none none
graceful shutdown worker draining no no
job cancelling yes no no
runtime metrics with telemetry no no
historic metrics with retained jobs no no
integrations with telemetry, pubsub no no

This is all based on my understanding of the other libraries through docs, issues and source code. It may not be entirely accurate! If I got anything wrong please let me know (@mbuhot @lpil)

23 Likes

Congrats on this excellent work :slightly_smiling_face:

I want to call your attention for the fact that you are leaking emails from your real customers in the screenshot of the web UI, and that can cause you legal issues, like with GDPR.

1 Like

Can you elaborate on the usage of advisory locks? One serious downside to advisory locks is that they are effectively a shared namespace. We use them within our own application, and it’s entirely possible that the ids in the job row will conflict with the values locked for some other purpose.

1 Like

Ah, I should have mentioned that everything in the dashboard is generated from “faker” data. There isn’t anything sensitive in the screenshot—in fact the only real part is that my laptop is called SorenBook =)

The host app where the UI live view is mounted generates a constant stream of fake jobs in various queues. This has been really helpful for testing with pseudo-production data.

3 Likes

Very cool! A few notes on Rihanna:

The latest error is retained in the job’s database record.
Orphaned jobs are not possible, when a node goes down the lock is released and they will be picked up on next poll.
Jobs can be deleted, though outside the Rihanna UI project this feature isn’t made overly accessible.

Non-global configuration, graceful shutdown, and multiple queues within a single table are all pretty straightforward to add and have open issues. If there’s a demand for them they can be implemented :slight_smile:

Rihanna uses them and allows the user to specify a custom namespace. I would prefer it if Postgres allowed a wider range of advisory locks though!

2 Likes

Is there any estimate of when the Dashboard will be available? Not trying to be pushy, only curious as that is such an awesome feature/addon.

I really like how you’ve designed this from a user (/developer) point of view, and will definitely be adding it to the app I’m currently building. I’m currently using Que (and like it) but I do have some use cases for a DB backed job queue which will necessitate switching.

2 Likes

Right, there are some major limitations to advisory locks. Part of the reason I used bigint primary keys instead of guid was so that they could double as lock values. For int you can use the pg_try_advisory_lock(key1 int, key2 int) variant and essentially namespace the lock. That doesn’t work with bigint though, in that case we only have pg_try_advisory_lock(key bigint). Initially I used the two int variety with the table oid and truncated the id to 32 bits, but it seemed messy so I ditched it for a simple bigint lock.

That’s a nice solution, definitely something I should look into implementing! For Oban it was critical that jobs use bigint instead of int, or at least have a mechanism to grow beyond 32 bits.

1 Like

Sorry, somehow I missed that. I’ll update the table accordingly.

Good to know. That’s an important use-case that I misinterpreted based on this statement in the README:

One thing to be aware of is that if you restart your application (e.g. because you deployed) then all running jobs on that node will be exited. For this reason it is probably sensible not to make your jobs take an extremely long time.

I’ll update the table for this one too :+1:

The bit about “job cancelling” refers to cancelling jobs that are currently running more so than deleting them. Though it is funny, with a Postgres backed queue it is really easy to remove a job. In a Redis backed queue it is a major pain!

1 Like

Makes sense, just trying to understand the role the advisory locks play. Is there a writeup within the repo I’m missing?

1 Like

Ah, now I follow. I don’t think there is a comprehensive description of how the advisory locks are used within the repo. It is strewn between the query module and some of the integration tests.

Essentially advisory locks are used to keep track of which jobs are actively executing and which ones are in the executing state, but actually belong to a dead node.

Earlier in this thread I described it in some detail. I’ll get that worked into the README or primary module docs.

3 Likes

Ah I see now, if the node that first tries the job goes down while the job is in progress, other nodes can determine that the job is abandoned by observing that there is a job in the "executing" state that lacks a corresponding lock.

I wonder if this could be achieved by a dedicated table that tracks cluster membership. Each node, when started, inserts a new row into a "nodes" table that contains basically just an id, the node name for debugging, and a alive_at timestamp. The node needs a process that updates the timestamp at some regular interval. When you cut a job, record the id of the node row. Abandoned jobs are any job rows where the timestamp on the corresponding node row alive_at is older than now() - the interval by some appreciable amount. You can garbage collect the "nodes" table when the alive_at value is old and there are no associated jobs in the running or queued states.

Thoughts?

P.S.

To be clear, I don’t have any concerns about the correctness of the current implementation. However I’m just in the unfortunate position of already using advisory locks on 64bit auto incrementing ids. I think advisory locks can be reasonably used by people’s application logic, but in a library it’s tough because it is easy to become incompatible with any other library that also uses them.

3 Likes

Precisely :+1:

I’m sure it could, this is essentially how Redis based tools work. They use two lists (sidekiq, verk) or a list and a hash (kiq) to hold a backup of each job while it executes.

What you’ve described is exactly how sidekiq and kiq handle tracking live nodes. It definitely works in that environment. I eschewed that in favor of advisory locks and pubsub to simply bookkeeping.

This is an excellent point. Based on your points and some of the other discussion in this thread I’m moving to the “namespaced” double int version.

Thanks for all the thought and feedback.

3 Likes

This seems like it would work, just make sure to let people pick the namespace so that they can ensure it doesn’t happen to be the same namespace they’ve picked for some other thing. This is also important if you had multiple copies of Oban running on the same database for some reason

4 Likes

I just saw the docs for implementing migrations for oban:

defmodule MyApp.Repo.Migrations.AddObanJobsTable do
  use Ecto.Migration

  defdelegate up, to: Oban.Migrations
  defdelegate down, to: Oban.Migrations
end

Generally I’m an advocate for immutable migrations, so I’m wondering what would happen if you release a new version with different database needs. Is everything in there idempotent so I can at least add another migration doing the same? But really I’d prefer something, where I can lock my migration to a certain version of your migration script.

8 Likes

Excellent point. While it was pre-1.0 I hadn’t expected the migrations to be immutable, but I can see how that is a concern.

I’m modifying the migration mechanism to support versions for the next release. There won’t be any breaking migrations though you may need to update some names in the older migrations.

5 Likes

I’ve made some changes to namespace the advisory locks. The namespace is based on the oid of the oban_jobs table, which is unique per database (and actually changes if you create/drop a table repeatedly). This may not be the final solution as I’m working to make the prefix and possibly the table name more flexible, but it eliminates the likely intersection with application level advisory locks.

Here is the commit for the curious. There is a bit more detail in the comments and CHANGELOG: https://github.com/sorentwo/oban/commit/461060fa6bfbdbed7d0aa7594277ad83b7b22a51

3 Likes

Oban v0.3.0 has been released. It includes a number of fixes and improvements that came directly from the conversations in this thread, so thank you all! :heart:

Directly from the CHANGELOG:

Added

  • [Oban] Allow setting queues: false or queues: nil to disable queue
    dispatching altogether. This makes it possible to override the default
    configuration within each environment, i.e. when testing.

    The docs have been updated to promote this mechanism, as well as noting that
    pruning must be disabled for testing. (@yogodoshi)

  • [Oban.Testing] The new testing module provides a set of helpers to make
    asserting and refuting enqueued jobs within tests much easier. (@bamorim)

Changed

  • [Oban.Migrations] Explicitly set id as a bigserial to avoid mistakenly
    generating a uuid primary key. (@arfl)

  • [Oban.Migrations] Use versioned migrations that are immutable. As database
    changes are required a new migration module is defined, but the interface of
    Oban.Migrations.up/0 and Oban.Migrations.down/0 will be maintained.

    From here on all releases with database changes will indicate that a new
    migration is necessary in this CHANGELOG. (@LostKobrakai)

  • [Oban.Query] Replace use of (bigint) with (int, int) for advisory locks.
    The first int acts as a namespace and is derived from the unique oid value
    for the oban_jobs table. The oid is unique within a database and even
    changes on repeat table definitions.

    This change aims to prevent lock collision with application level advisory
    lock usage and other libraries. Now there is a 1 in 2,147,483,647 chance of
    colliding with other locks. (@benwilson512)

  • [Oban.Job] Automatically remove leading “Elixir.” namespace from stringified
    worker name. The prefix complicates full text searching and reduces the score
    for trigram matches.

Note: When upgrading a migration is required.

v0.3.0 Docs
Testing Changes

12 Likes

I’d love to use the UI as well, and contribute where I can.

@sorentwo any plans on releasing a preview soon?

Thanks for your work on this: I just switched some stuff over to Oban in a couple of hours. :+1:

Any plans on adding batches, like Sidekiq’s batches? I think this is one of the most powerful abstractions I’ve used for years.

2 Likes

That’s great to hear! I’m glad it was a smooth process.

Yes, I hope to have a preview version of the UI ready by mid June. There are some essential features that are lacking currently. Once those are implemented and a few bugs are worked out it will be ready to try.

There are a few other features that I plan on tackling first (most of which overlap with the Sidekiq Enterprise feature set):

  • Expiring Jobs
  • Periodic Jobs (like cron jobs)
  • Rate Limiting
  • Dampeners (automatic queue scaling based on mem/cpu usage)

Batches are a great addition to the list!

3 Likes