Oban — Reliable and Observable Job Processing

brightball · May 17, 2019, 3:44am

If it doesn’t change the code much, could be something you could list as a suggestion if available.

sorentwo · May 20, 2019, 7:26pm

This is purely a comparison of how the libraries are structured and their features. Between building Kiq and Oban I investigated nearly every library in the ecosystem (EctoJob, Exq, Honeydew, Que, Rihanna and Verk). I learned something from each one and owe all of the authors a debt of gratitude.

There are far too many differences between the various libraries to summarize them here. Therefore I’m going to cheat a bit and make a table highlighting the differences between only the libraries you asked about.

Feature	Oban	EctoJob	Rihanna
perform return	output doesn’t matter	multi transaction	success tuple
job scheduling	triggered, polled	triggered, polled	polled
args storage	jsonb	jsonb	erlang terms
error retention	all full historic errors	none	last error
execution time	unlimited	configured timeout	unlimited
orphaned jobs	rescued, guarded by locks	inside a transaction	guarded by locks
queues	multiple with single table	one per table	single
queue limits	configured limit per queue	configured limit per queue	configured globally
queue changes	pause, resume, scale	none	none
graceful shutdown	worker draining	no	no
job cancelling	yes	no	no
runtime metrics	with telemetry	no	no
historic metrics	with retained jobs	no	no
integrations	with telemetry, pubsub	no	no

This is all based on my understanding of the other libraries through docs, issues and source code. It may not be entirely accurate! If I got anything wrong please let me know (@mbuhot @lpil)

Exadra37 · May 20, 2019, 8:50pm

Congrats on this excellent work

I want to call your attention for the fact that you are leaking emails from your real customers in the screenshot of the web UI, and that can cause you legal issues, like with GDPR.

benwilson512 · May 20, 2019, 8:53pm

Can you elaborate on the usage of advisory locks? One serious downside to advisory locks is that they are effectively a shared namespace. We use them within our own application, and it’s entirely possible that the ids in the job row will conflict with the values locked for some other purpose.

sorentwo · May 20, 2019, 9:33pm

Ah, I should have mentioned that everything in the dashboard is generated from “faker” data. There isn’t anything sensitive in the screenshot—in fact the only real part is that my laptop is called SorenBook =)

The host app where the UI live view is mounted generates a constant stream of fake jobs in various queues. This has been really helpful for testing with pseudo-production data.

lpil · May 20, 2019, 10:12pm

Very cool! A few notes on Rihanna:

The latest error is retained in the job’s database record.
Orphaned jobs are not possible, when a node goes down the lock is released and they will be picked up on next poll.
Jobs can be deleted, though outside the Rihanna UI project this feature isn’t made overly accessible.

Non-global configuration, graceful shutdown, and multiple queues within a single table are all pretty straightforward to add and have open issues. If there’s a demand for them they can be implemented

Rihanna uses them and allows the user to specify a custom namespace. I would prefer it if Postgres allowed a wider range of advisory locks though!

phoffer · May 20, 2019, 10:40pm

Is there any estimate of when the Dashboard will be available? Not trying to be pushy, only curious as that is such an awesome feature/addon.

I really like how you’ve designed this from a user (/developer) point of view, and will definitely be adding it to the app I’m currently building. I’m currently using Que (and like it) but I do have some use cases for a DB backed job queue which will necessitate switching.

sorentwo · May 20, 2019, 10:41pm

Right, there are some major limitations to advisory locks. Part of the reason I used bigint primary keys instead of guid was so that they could double as lock values. For int you can use the pg_try_advisory_lock(key1 int, key2 int) variant and essentially namespace the lock. That doesn’t work with bigint though, in that case we only have pg_try_advisory_lock(key bigint). Initially I used the two int variety with the table oid and truncated the id to 32 bits, but it seemed messy so I ditched it for a simple bigint lock.

That’s a nice solution, definitely something I should look into implementing! For Oban it was critical that jobs use bigint instead of int, or at least have a mechanism to grow beyond 32 bits.

sorentwo · May 20, 2019, 10:45pm

Sorry, somehow I missed that. I’ll update the table accordingly.

Good to know. That’s an important use-case that I misinterpreted based on this statement in the README:

One thing to be aware of is that if you restart your application (e.g. because you deployed) then all running jobs on that node will be exited. For this reason it is probably sensible not to make your jobs take an extremely long time.

I’ll update the table for this one too

The bit about “job cancelling” refers to cancelling jobs that are currently running more so than deleting them. Though it is funny, with a Postgres backed queue it is really easy to remove a job. In a Redis backed queue it is a major pain!

benwilson512 · May 21, 2019, 1:16am

Makes sense, just trying to understand the role the advisory locks play. Is there a writeup within the repo I’m missing?

sorentwo · May 21, 2019, 1:54am

Ah, now I follow. I don’t think there is a comprehensive description of how the advisory locks are used within the repo. It is strewn between the query module and some of the integration tests.

Essentially advisory locks are used to keep track of which jobs are actively executing and which ones are in the executing state, but actually belong to a dead node.

Earlier in this thread I described it in some detail. I’ll get that worked into the README or primary module docs.

benwilson512 · May 21, 2019, 1:39pm

Ah I see now, if the node that first tries the job goes down while the job is in progress, other nodes can determine that the job is abandoned by observing that there is a job in the "executing" state that lacks a corresponding lock.

I wonder if this could be achieved by a dedicated table that tracks cluster membership. Each node, when started, inserts a new row into a "nodes" table that contains basically just an id, the node name for debugging, and a alive_at timestamp. The node needs a process that updates the timestamp at some regular interval. When you cut a job, record the id of the node row. Abandoned jobs are any job rows where the timestamp on the corresponding node row alive_at is older than now() - the interval by some appreciable amount. You can garbage collect the "nodes" table when the alive_at value is old and there are no associated jobs in the running or queued states.

Thoughts?

P.S.

To be clear, I don’t have any concerns about the correctness of the current implementation. However I’m just in the unfortunate position of already using advisory locks on 64bit auto incrementing ids. I think advisory locks can be reasonably used by people’s application logic, but in a library it’s tough because it is easy to become incompatible with any other library that also uses them.

sorentwo · May 21, 2019, 2:20pm

Precisely

I’m sure it could, this is essentially how Redis based tools work. They use two lists (sidekiq, verk) or a list and a hash (kiq) to hold a backup of each job while it executes.

What you’ve described is exactly how sidekiq and kiq handle tracking live nodes. It definitely works in that environment. I eschewed that in favor of advisory locks and pubsub to simply bookkeeping.

This is an excellent point. Based on your points and some of the other discussion in this thread I’m moving to the “namespaced” double int version.

Thanks for all the thought and feedback.

benwilson512 · May 21, 2019, 2:29pm

This seems like it would work, just make sure to let people pick the namespace so that they can ensure it doesn’t happen to be the same namespace they’ve picked for some other thing. This is also important if you had multiple copies of Oban running on the same database for some reason

LostKobrakai · May 23, 2019, 9:43am

I just saw the docs for implementing migrations for oban:

defmodule MyApp.Repo.Migrations.AddObanJobsTable do
  use Ecto.Migration

  defdelegate up, to: Oban.Migrations
  defdelegate down, to: Oban.Migrations
end

Generally I’m an advocate for immutable migrations, so I’m wondering what would happen if you release a new version with different database needs. Is everything in there idempotent so I can at least add another migration doing the same? But really I’d prefer something, where I can lock my migration to a certain version of your migration script.

sorentwo · May 23, 2019, 12:34pm

Excellent point. While it was pre-1.0 I hadn’t expected the migrations to be immutable, but I can see how that is a concern.

I’m modifying the migration mechanism to support versions for the next release. There won’t be any breaking migrations though you may need to update some names in the older migrations.

sorentwo · May 24, 2019, 5:41pm

I’ve made some changes to namespace the advisory locks. The namespace is based on the oid of the oban_jobs table, which is unique per database (and actually changes if you create/drop a table repeatedly). This may not be the final solution as I’m working to make the prefix and possibly the table name more flexible, but it eliminates the likely intersection with application level advisory locks.

Here is the commit for the curious. There is a bit more detail in the comments and CHANGELOG: https://github.com/sorentwo/oban/commit/461060fa6bfbdbed7d0aa7594277ad83b7b22a51

sorentwo · May 29, 2019, 3:59pm

Oban v0.3.0 has been released. It includes a number of fixes and improvements that came directly from the conversations in this thread, so thank you all!

Directly from the CHANGELOG:

Added

[Oban] Allow setting queues: false or queues: nil to disable queue
dispatching altogether. This makes it possible to override the default
configuration within each environment, i.e. when testing.

The docs have been updated to promote this mechanism, as well as noting that
pruning must be disabled for testing. (@yogodoshi)
[Oban.Testing] The new testing module provides a set of helpers to make
asserting and refuting enqueued jobs within tests much easier. (@bamorim)

Changed

[Oban.Migrations] Explicitly set id as a bigserial to avoid mistakenly
generating a uuid primary key. (@arfl)
[Oban.Migrations] Use versioned migrations that are immutable. As database
changes are required a new migration module is defined, but the interface of
Oban.Migrations.up/0 and Oban.Migrations.down/0 will be maintained.

From here on all releases with database changes will indicate that a new
migration is necessary in this CHANGELOG. (@LostKobrakai)
[Oban.Query] Replace use of (bigint) with (int, int) for advisory locks.
The first int acts as a namespace and is derived from the unique oid value
for the oban_jobs table. The oid is unique within a database and even
changes on repeat table definitions.

This change aims to prevent lock collision with application level advisory
lock usage and other libraries. Now there is a 1 in 2,147,483,647 chance of
colliding with other locks. (@benwilson512)
[Oban.Job] Automatically remove leading “Elixir.” namespace from stringified
worker name. The prefix complicates full text searching and reduces the score
for trigram matches.

Note: When upgrading a migration is required.

v0.3.0 Docs
Testing Changes

jc00ke · May 30, 2019, 9:49pm

I’d love to use the UI as well, and contribute where I can.

@sorentwo any plans on releasing a preview soon?

Thanks for your work on this: I just switched some stuff over to Oban in a couple of hours.

Any plans on adding batches, like Sidekiq’s batches? I think this is one of the most powerful abstractions I’ve used for years.

sorentwo · May 31, 2019, 2:10pm

That’s great to hear! I’m glad it was a smooth process.

Yes, I hope to have a preview version of the UI ready by mid June. There are some essential features that are lacking currently. Once those are implemented and a few bugs are worked out it will be ready to try.

There are a few other features that I plan on tackling first (most of which overlap with the Sidekiq Enterprise feature set):

Expiring Jobs
Periodic Jobs (like cron jobs)
Rate Limiting
Dampeners (automatic queue scaling based on mem/cpu usage)

Batches are a great addition to the list!