If it doesn’t change the code much, could be something you could list as a suggestion if available.
This is purely a comparison of how the libraries are structured and their features. Between building Kiq and Oban I investigated nearly every library in the ecosystem (EctoJob, Exq, Honeydew, Que, Rihanna and Verk). I learned something from each one and owe all of the authors a debt of gratitude.
There are far too many differences between the various libraries to summarize them here. Therefore I’m going to cheat a bit and make a table highlighting the differences between only the libraries you asked about.
Feature | Oban | EctoJob | Rihanna |
---|---|---|---|
perform return | output doesn’t matter | multi transaction | success tuple |
job scheduling | triggered, polled | triggered, polled | polled |
args storage | jsonb | jsonb | erlang terms |
error retention | all full historic errors | none | last error |
execution time | unlimited | configured timeout | unlimited |
orphaned jobs | rescued, guarded by locks | inside a transaction | guarded by locks |
queues | multiple with single table | one per table | single |
queue limits | configured limit per queue | configured limit per queue | configured globally |
queue changes | pause, resume, scale | none | none |
graceful shutdown | worker draining | no | no |
job cancelling | yes | no | no |
runtime metrics | with telemetry | no | no |
historic metrics | with retained jobs | no | no |
integrations | with telemetry, pubsub | no | no |
This is all based on my understanding of the other libraries through docs, issues and source code. It may not be entirely accurate! If I got anything wrong please let me know (@mbuhot @lpil)
Congrats on this excellent work
I want to call your attention for the fact that you are leaking emails from your real customers in the screenshot of the web UI, and that can cause you legal issues, like with GDPR.
Can you elaborate on the usage of advisory locks? One serious downside to advisory locks is that they are effectively a shared namespace. We use them within our own application, and it’s entirely possible that the ids in the job row will conflict with the values locked for some other purpose.
Ah, I should have mentioned that everything in the dashboard is generated from “faker” data. There isn’t anything sensitive in the screenshot—in fact the only real part is that my laptop is called SorenBook =)
The host app where the UI live view is mounted generates a constant stream of fake jobs in various queues. This has been really helpful for testing with pseudo-production data.
Very cool! A few notes on Rihanna:
The latest error is retained in the job’s database record.
Orphaned jobs are not possible, when a node goes down the lock is released and they will be picked up on next poll.
Jobs can be deleted, though outside the Rihanna UI project this feature isn’t made overly accessible.
Non-global configuration, graceful shutdown, and multiple queues within a single table are all pretty straightforward to add and have open issues. If there’s a demand for them they can be implemented
Rihanna uses them and allows the user to specify a custom namespace. I would prefer it if Postgres allowed a wider range of advisory locks though!
Is there any estimate of when the Dashboard will be available? Not trying to be pushy, only curious as that is such an awesome feature/addon.
I really like how you’ve designed this from a user (/developer) point of view, and will definitely be adding it to the app I’m currently building. I’m currently using Que (and like it) but I do have some use cases for a DB backed job queue which will necessitate switching.
Right, there are some major limitations to advisory locks. Part of the reason I used bigint
primary keys instead of guid
was so that they could double as lock values. For int
you can use the pg_try_advisory_lock(key1 int, key2 int)
variant and essentially namespace the lock. That doesn’t work with bigint
though, in that case we only have pg_try_advisory_lock(key bigint)
. Initially I used the two int
variety with the table oid
and truncated the id to 32 bits, but it seemed messy so I ditched it for a simple bigint
lock.
That’s a nice solution, definitely something I should look into implementing! For Oban it was critical that jobs use bigint
instead of int
, or at least have a mechanism to grow beyond 32 bits.
Sorry, somehow I missed that. I’ll update the table accordingly.
Good to know. That’s an important use-case that I misinterpreted based on this statement in the README:
One thing to be aware of is that if you restart your application (e.g. because you deployed) then all running jobs on that node will be exited. For this reason it is probably sensible not to make your jobs take an extremely long time.
I’ll update the table for this one too
The bit about “job cancelling” refers to cancelling jobs that are currently running more so than deleting them. Though it is funny, with a Postgres backed queue it is really easy to remove a job. In a Redis backed queue it is a major pain!
Makes sense, just trying to understand the role the advisory locks play. Is there a writeup within the repo I’m missing?
Ah, now I follow. I don’t think there is a comprehensive description of how the advisory locks are used within the repo. It is strewn between the query module and some of the integration tests.
Essentially advisory locks are used to keep track of which jobs are actively executing and which ones are in the executing state, but actually belong to a dead node.
Earlier in this thread I described it in some detail. I’ll get that worked into the README or primary module docs.
Ah I see now, if the node that first tries the job goes down while the job is in progress, other nodes can determine that the job is abandoned by observing that there is a job in the "executing"
state that lacks a corresponding lock.
I wonder if this could be achieved by a dedicated table that tracks cluster membership. Each node, when started, inserts a new row into a "nodes"
table that contains basically just an id
, the node name for debugging, and a alive_at
timestamp. The node needs a process that updates the timestamp at some regular interval. When you cut a job, record the id
of the node row. Abandoned jobs are any job rows where the timestamp on the corresponding node row alive_at
is older than now() - the interval
by some appreciable amount. You can garbage collect the "nodes"
table when the alive_at
value is old and there are no associated jobs in the running or queued states.
Thoughts?
P.S.
To be clear, I don’t have any concerns about the correctness of the current implementation. However I’m just in the unfortunate position of already using advisory locks on 64bit auto incrementing ids. I think advisory locks can be reasonably used by people’s application logic, but in a library it’s tough because it is easy to become incompatible with any other library that also uses them.
Precisely
I’m sure it could, this is essentially how Redis based tools work. They use two lists (sidekiq, verk) or a list and a hash (kiq) to hold a backup of each job while it executes.
What you’ve described is exactly how sidekiq and kiq handle tracking live nodes. It definitely works in that environment. I eschewed that in favor of advisory locks and pubsub to simply bookkeeping.
This is an excellent point. Based on your points and some of the other discussion in this thread I’m moving to the “namespaced” double int version.
Thanks for all the thought and feedback.
This seems like it would work, just make sure to let people pick the namespace so that they can ensure it doesn’t happen to be the same namespace they’ve picked for some other thing. This is also important if you had multiple copies of Oban
running on the same database for some reason
I just saw the docs for implementing migrations for oban:
defmodule MyApp.Repo.Migrations.AddObanJobsTable do
use Ecto.Migration
defdelegate up, to: Oban.Migrations
defdelegate down, to: Oban.Migrations
end
Generally I’m an advocate for immutable migrations, so I’m wondering what would happen if you release a new version with different database needs. Is everything in there idempotent so I can at least add another migration doing the same? But really I’d prefer something, where I can lock my migration to a certain version of your migration script.
Excellent point. While it was pre-1.0 I hadn’t expected the migrations to be immutable, but I can see how that is a concern.
I’m modifying the migration mechanism to support versions for the next release. There won’t be any breaking migrations though you may need to update some names in the older migrations.
I’ve made some changes to namespace the advisory locks. The namespace is based on the oid
of the oban_jobs
table, which is unique per database (and actually changes if you create/drop a table repeatedly). This may not be the final solution as I’m working to make the prefix and possibly the table name more flexible, but it eliminates the likely intersection with application level advisory locks.
Here is the commit for the curious. There is a bit more detail in the comments and CHANGELOG: https://github.com/sorentwo/oban/commit/461060fa6bfbdbed7d0aa7594277ad83b7b22a51
Oban v0.3.0 has been released. It includes a number of fixes and improvements that came directly from the conversations in this thread, so thank you all!
Directly from the CHANGELOG:
Added
-
[Oban] Allow setting
queues: false
orqueues: nil
to disable queue
dispatching altogether. This makes it possible to override the default
configuration within each environment, i.e. when testing.The docs have been updated to promote this mechanism, as well as noting that
pruning must be disabled for testing. (@yogodoshi) -
[Oban.Testing] The new testing module provides a set of helpers to make
asserting and refuting enqueued jobs within tests much easier. (@bamorim)
Changed
-
[Oban.Migrations] Explicitly set
id
as abigserial
to avoid mistakenly
generating auuid
primary key. (@arfl) -
[Oban.Migrations] Use versioned migrations that are immutable. As database
changes are required a new migration module is defined, but the interface of
Oban.Migrations.up/0
andOban.Migrations.down/0
will be maintained.From here on all releases with database changes will indicate that a new
migration is necessary in this CHANGELOG. (@LostKobrakai) -
[Oban.Query] Replace use of
(bigint)
with(int, int)
for advisory locks.
The firstint
acts as a namespace and is derived from the uniqueoid
value
for theoban_jobs
table. Theoid
is unique within a database and even
changes on repeat table definitions.This change aims to prevent lock collision with application level advisory
lock usage and other libraries. Now there is a 1 in 2,147,483,647 chance of
colliding with other locks. (@benwilson512) -
[Oban.Job] Automatically remove leading “Elixir.” namespace from stringified
worker name. The prefix complicates full text searching and reduces the score
for trigram matches.
Note: When upgrading a migration is required.
I’d love to use the UI as well, and contribute where I can.
@sorentwo any plans on releasing a preview soon?
Thanks for your work on this: I just switched some stuff over to Oban
in a couple of hours.
Any plans on adding batches, like Sidekiq’s batches? I think this is one of the most powerful abstractions I’ve used for years.
That’s great to hear! I’m glad it was a smooth process.
Yes, I hope to have a preview version of the UI ready by mid June. There are some essential features that are lacking currently. Once those are implemented and a few bugs are worked out it will be ready to try.
There are a few other features that I plan on tackling first (most of which overlap with the Sidekiq Enterprise feature set):
- Expiring Jobs
- Periodic Jobs (like cron jobs)
- Rate Limiting
- Dampeners (automatic queue scaling based on mem/cpu usage)
Batches are a great addition to the list!