Oban — Reliable and Observable Job Processing

sorentwo · October 25, 2019, 3:44pm

It is kind of by design. I have a separate package called “oban_cron” that I haven’t published yet, but it does exactly what you’re asking about. I’ve debated making it part of oban itself, but it works nicely as a separate package.

Quantum is really powerful, but it doesn’t make sense in every situation IMO. There are a couple of downsides that I hoped to address with a focused package:

Work without connected nodes. Quantum uses node communication for leadership, to prevent double-enqueuing a job.
Have a smaller footprint. Quantum is a lot of code when all you want to specify is {"0 14 * * MON", MyApp.SomeWorker} in some configuration.

cnck1387 · October 25, 2019, 3:52pm

I remember in the Rails days using Sidekiq but then also lugging around Clockwork just for scheduling. When glancing your readme one of the first things I looked for was repeated scheduled tasks.

I wonder what others think about putting oban_cron into the main repo.

sorentwo · October 25, 2019, 8:10pm

Informally I’ll consider “hearts” on this post to be in favor of including it directly in Oban. If anybody has arguments against inclusion please post!

OvermindDL1 · October 25, 2019, 8:20pm

reads post, hearts it as confirmation

Lol, but really, it’s not hard to not use it and has no overhead if not used if it’s not needed, and is very specific to oban anyway, so…

belaustegui · October 27, 2019, 10:58am

I think that it would be a great addition to Oban.

The current version of Quantum is unstable when clustering in certain cases. We’ve experienced some of them in the company I work for which caused some undesirable effects such as tasks not running or running once per each node in the cluster.
This is acknowledged in Quantum and they are working in a solution.

Oban has been rock solid and reliable since we started using it in the project. So it could be a wonderful alternative.

sorentwo · November 1, 2019, 6:48pm

@cnck1387 @belaustegui I merged Periodic (CRON) support into Oban today. Please check it out, it will be included in v0.11 sometime next week.

Thanks for the encouragement!

cnck1387 · November 1, 2019, 7:04pm

Nice work. Looking forward to replacing Quantum with this. After just glancing at your commit, that makes me really happy to see. I had nothing against Quantum but being able to drop a whole library with a ton of code for a few dozen lines of code is a huge win.

One question. In your new docs you put “Jobs are considered unique for most of each minute”. What type of race conditions or edge cases should we be aware of to hit the points where it might not be unique?

sorentwo · November 1, 2019, 7:22pm

Jobs are marked as unique for 59 seconds, not 60 seconds. There has to be some wiggle room between the unique period and the next enqueue cycle. That leaves a one second window where a theoretical double-enqueue is possible. That situation would only happen if you can restart your node fast enough that it enqueued at the start of one second, booted up and enqueued at the end of the same second.

mischov · November 1, 2019, 7:38pm

If I’m understanding correctly, this supports running a job on a single node in a cluster periodically.

It might be beyond the scope of Oban, but is there a way to run a job on every node in a cluster periodically? I’ve used Quantum that way for things like refreshing credentials periodically.

sorentwo · November 1, 2019, 7:54pm

Right. That’s the goal.

I hadn’t considered that use case. It would certainly be possible: eliminate the transaction lock and remove the unique period while scheduling.

You could kind of hack it with the current implementation by overriding the unique period within your job so that it’s very short:

use Oban.Job, queue: "scheduled", unique: [period: 1]

If the nodes start at different times that would work, but it wouldn’t be very reliable.

I’ll think about this use case a bit.

sorentwo · November 6, 2019, 2:13pm

Oban v0.11.0 is published with a variety of bug fixes and the addition of CRON jobs. There is an optional migration that will prevent issues recording beats when job ids get into the 64bit range.

Anybody using the UI beta should upgrade, the oban_update notification change is essential to keeping stats updated.

From the CHANGELOG

Migration Optional (V6)

Job id’s greater than 2,147,483,647 (PG int limit) can’t be inserted into the running array on oban_beats. The array that Ecto defines uses int instead of bigint, which can’t store the larger integers. This migration changes the column type to bigint[], a locking operation that may take a few seconds.

Added

[Oban] Added crontab support for automatically enqueuing jobs on a fixed schedule. A combination of transactional locks and unique jobs prevents scheduling duplicate jobs.

Fixed

[Oban.Migrations] Add a comment when migrating oban_jobs to V5 and when rolling back down to V4.
[Oban.Query] Apply the configured log level to unique queries.
[Oban.Notifier] Prevent open connections from accumulating when the circuit is tripped during the connection phase. This change may leave notifications in a state where they aren’t listening to all channels.

Changed

[Oban.Notifier] Replay oban_update notifications to subscribed processes.

sorentwo · November 26, 2019, 7:13pm

Oban v0.12.0 is out with some fun features, testing improvements, bug fixes and a helpful (optional) migration for large pruning operations. Thanks to all of the contributors who made this one possible!

From the CHANGELOG

Migration Optional (V7)

The queries used to prune by limit and age are written to utilize a single partial index for a huge performance boost on large tables. The new V7 migration will create the index for you—but that may not be ideal for tables with millions of completed or discarded jobs because it can’t be done concurrently.

If you have an extremely large jobs table you can add the index concurrently in a dedicated migration:

create index(
         :oban_jobs,
         ["attempted_at desc", :id],
         where: "state in ('completed', 'discarded')",
         name: :oban_jobs_attempted_at_id_index,
         concurrently: true
       )

Added

[Oban] Add start_queue/3 and stop_queue/2 for dynamically starting and stopping supervised queues across nodes.
[Oban] Add drain_queue/3 to accept drain options. with_scheduled: true allows draining scheduled jobs.
[Oban] Expose circuit_backoff as a “twiddly” option that controls how long tripped circuit breakers wait until re-opening.
[Oban.Testing] Accept a value/delta tuple for testing timestamp fields. This allows more robust testing of timestamps such as scheduled_at.
[Oban.Telemetry] Emit [:oban, :trip_circuit] and [:oban, :open_circuit] events for circuit breaker activity. Previously an error was logged when the circuit was tripped, but there wasn’t any way to monitor circuit breakers.

Circuit breaker activity is logged by the default telemetry logger (both :trip_circuit and :open_circuit events).

Fixed

[Oban.Query] Avoid using prepared statements for all unique queries. This forces Postgres to use a custom plan (which utilizes the compound index) rather than falling back to a generic plan.
[Oban.Job] Include all permitted fields when converting a Job to a map, preserving any optional values that were either specified by the user or came via Worker defaults.
[Oban.Migrations] Guard against missing migration modules in federated environments.

Changed

[Oban] Allow the multi name provided to Oban.insert/3,4 to be any term, not just an atom.
[Oban.Query] Use a consistent and more performant set of queries for pruning. Both pruning methods are optimized to utilize a single partial index.

brainlid · November 26, 2019, 7:34pm

For people interested in Oban, Parker Selbert (@sorentwo) was recently on the ElixirMix podcast talking about it.

https://devchat.tv/elixir-mix/emx-079-oban-with-parker-selbert/

smaximov · November 27, 2019, 8:22am

I think you forgot to push a tag for v0.12.0 to Github.

sorentwo · November 27, 2019, 12:15pm

Quite right. Pushed it up now.

jaimeiniesta · December 5, 2019, 10:44am

I’ve switched a production app to Oban, the change from Exq has been super smooth, and I’ve also been able to get rid of Quantum and Redis.

So far it’s been working for 48 hours and processed 150K jobs, zero problems

Thanks!

thousandsofthem · December 6, 2019, 8:27am

Well, Redis is about a lot of more load than that, then regular database becomes overloaded

sorentwo · December 6, 2019, 8:37pm

Redis is excellent in many capacities but persistent storage isn’t one of them. When background jobs are doing work that is important to your business then you should treat them like the rest of your data.

You’d have to do a massive amount of background work to overload a production PG database. Redis is fast, but it is single threaded and when you do complex work on the server (e.g. with lua scripts) then it can get overloaded as well—a medium-traffic system I work on has 70k+ timeout errors from Redis over the past several months. Application level pooling doesn’t help much either because the server on the other end is still only single threaded.

sb8244 · December 6, 2019, 8:57pm

This one might be application specific. I am cautious about extracting meaning about Redis based on this. I do agree, though, that having persistence through Postgres for jobs is really useful and is one of the best parts of using PG for a job server. My application does about 10k+ inserts/job processes per second to Redis (through Sidekiq LUA scripts) at peak and I haven’t seen these timeouts before.

Application pooling is pretty significant with Redis, and I would say is almost required for production usage, because it parallelizes the network traffic with Redis. The work is still single-threaded, but the network time is significantly reduced. Here is an image that demonstrates the same amount of work in a singular vs pooled Redis environment. The pooling would have a significant impact here.

None of this has much to do with Oban, though. Postgres works great for 99% of applications, and could be tuned with something like Citus for extremely high throughput environments. Many people will choose not having another tool in their stack, so they’ll choose Oban.

sorentwo · December 6, 2019, 9:29pm

Totally fair. That is definitely application specific and Redis is being used as the glue between a lot of things in that system (in addition to Sidekiq LUA and Kiq LUA scripts). I only meant to illustrate that Redis can have timeouts too when it gets enough load.