Oban — Reliable and Observable Job Processing

sorentwo · August 15, 2019, 6:46pm

I had a feeling you might ask something about that. The moment I added unlocked job id tracking in producers I thought about that discussion.

An alternative that doesn’t generate a lot of garbage (dead tuples) and automatically cleans up on disconnect would be fantastic. Neither of those seem possible purely with userland tables. Maybe the garbage wouldn’t be that big an issue, though. I would need to benchmark to find out.

benwilson512 · August 15, 2019, 6:56pm

I mean advisory locks are ultimately rows too. Managing a job ownership table isn’t gonna produce a whole lot more garbage than dead advisory lock rows. Both will produce some but ultimately we need to just trust VACUUM to do its job properly. It may also be possible to simply put the node information in the job row itself, I need to think through the details some.

EDIT: One option that would help with this specific issue is to just have a single dedicated process per node hold the locks. This probably isn’t performant enough however. I can get away with it in Fable because locks are only acquired on boot and then held for the whole time the node is up. The locks here are per job right?

sorentwo · August 15, 2019, 8:46pm

Right, they are taken whenever a job is moved to the executing stage.

An important additional note/requirement is that the solution works across languages. I will implement Ruby and Python versions (and possibly others) and they need to have compatibility.

benwilson512 · August 15, 2019, 9:06pm

Ruby / python job runners? Or do they simply need to be able to enqueue jobs.

EDIT: I guess it doesn’t really matter, all of this is just bare postgres anyway. Honestly a set of node tracking tables is probably easier to manage than the advisory locks in other languages cause it’s hard to deal with the locks when you start using something like pg bouncer.

sorentwo · August 15, 2019, 9:21pm

Full job runners, meaning they would take locks, gossip, everything.

Enqueuing jobs is really straight forward, but it only works when all the jobs are in Elixir.

sorentwo · September 6, 2019, 7:01pm

I’ve released Oban v0.8.0, AKA “Ben Wilson Edition”. This is a pretty big release that contains one critical change—the introduction of an table to track heartbeats and dropping the use of advisory locks entirely. This will reduce the amount of log noise (rampant warnings) that we all see, and it improves the reliability of job recovery as well.

If all goes well this is the last big set of changes before 1.0.0

From the CHANGELOG

Added

[Oban.Job] Add an attempted_by column used to track the node, queue and producer nonce that attempted to execute a job.
[Oban.Beat] Add a new oban_beats table, used by producers to publish “pulse” information including the node, queue, running jobs and other information
previously published by gossip notifications.

Beat records older than one hour are pruned automatically. Beat pruning respects the :disabled setting, but it ignores length and age configuration. The goal is to prevent bloating the table with useless historic information—each queue generates one beat a second, 3,600 beat records per hour even when the queue is idle.

Changed

[Oban.Executor] Don’t swallow an ArgumentError when raised by a worker’s backoff function. @balexand
[Oban.Notifier] Remove gossip notifications entirely, superseded by pulse activity written to oban_beats.
[Oban.Query] Remove all use of advisory locks!
[Oban.Producer] Periodically attempt to rescue orphans, not just at startup. By default a rescue is attempted once a minute and it checks for any executing jobs belonging to a producer that hasn’t had a pulse for more than a minute.

Fixed

[Oban.Worker] Validate worker options after the module is compiled. This allows dynamic configuration of compile time settings via module attributes, functions, Application.get_env/3, etc.
[Oban.Query] Remove scheduled_at check from job fetching query. This could prevent available jobs from dispatching when the database’s time differed from the system time. @seanwash
[Oban.Migrations] Fix off-by-one error when detecting the version to migrate up from.

cnck1387 · September 6, 2019, 7:21pm

How different is this compared to using something like broadway?

benwilson512 · September 6, 2019, 7:29pm

Oban in general? I mean it’s a persistent job queue in the classical sense, whereas Broadway is more a data processing pipeline. The difference is that each item in a job queue is usually a discrete piece of work, whereas individual items in a data processing pipeline are usually part of some data stream. This has a lot of implications with respect to API, implementation, performance, etc.

smaximov · September 9, 2019, 11:12am

Is this (heartbeat/pulse tracking) an established/well known technique? I tried to google it but didn’t find relevant results (btw, the search query “postgres pulse tracking” yields the Oban Github repo as the first hit )

sorentwo · September 9, 2019, 12:36pm

Many systems use heartbeats to indicate that they are still alive (i.e. Phoenix Channels), but they aren’t typically persistent. The technique used in Oban is derived from how Kiq/Sidekiq use heartbeats to track activity. Those libraries are built on Redis and use a complex system of hash objects and timeouts, while Oban is able to use a single table for tracking.

AndyL · September 9, 2019, 12:46pm

Since using Oban in a couple apps, my Postgres logs grow by tens of gigabytes per day. I suppose it may be due to the heartbeat. The logs actually consumed all disk and crashed one of my servers. Now that I’m aware it’s easy to prevent. Oban works great, but pay attention to the size of your Postgres logs!

sorentwo · September 9, 2019, 12:54pm

The log size issue was due to advisory locks, which are no longer used, largely for that exact reason!

Advisory locks themselves are very lightweight, but they are owned by the connection that took the lock initially. When any other connection attempts to release the lock Postgres logs a big fat WARNING. This situation happens constantly because the connection checked out from the db pool is random—the larger the pool the less likely that the checked out connection owned the advisory lock.

The new system writes rows to an oban_beats table and operates warning free. Definitely upgrade to 0.8.0 if you haven’t already!

AndyL · September 9, 2019, 1:42pm

Aah - didn’t read the notice!! Just upgraded. From midnight to 06:00 with old Oban, Postgres generated 13Gb of log data. After upgrade and watching the system for 30 minutes with two live apps, I see zero heartbeat data in the postgres logs. Yahoo!

sorentwo · September 9, 2019, 3:21pm

Really pleased to hear that it worked as expected!

As a side benefit, the new approach is more reliable—if a producer crashes the jobs can be recovered right away on startup, whereas with advisory locks they wouldn’t be recovered until the system restarted. (I wrote about that in more detail on Slack through a thread with @chrismccord, but it isn’t available anymore).

AndyL · September 9, 2019, 4:15pm

Same! Just FYI here’s a screenshot from one of my Oban apps - a WIP cron-style job scheduler used internally. Oban and LiveView work well together!

Here’s the source: GitHub - andyl/jobex: Cron-like Workflow

typicalpixel · September 11, 2019, 6:27pm

Is there a way to use Oban.insert in a multi, with function passing in order to get a result from a previous insert’s result?

LostKobrakai · September 11, 2019, 6:58pm

Calling it within Ecto.Multi.run should work just fine.

davit-khaburdzania · September 11, 2019, 7:36pm

@sorentwo is there way to check out the Web UI, that is in development ?

sorentwo · September 11, 2019, 8:38pm

@LostKobrakai is correct, you can call it with Multi.run just fine. That’s all that Oban’s Multi support is really doing anyhow.

typicalpixel · September 11, 2019, 9:10pm

Okay, thank you. That is what I was doing, and it’s working fine, but wasn’t sure if that was the correct method.