Oban — Reliable and Observable Job Processing

I had a feeling you might ask something about that. The moment I added unlocked job id tracking in producers I thought about that discussion.

An alternative that doesn’t generate a lot of garbage (dead tuples) and automatically cleans up on disconnect would be fantastic. Neither of those seem possible purely with userland tables. Maybe the garbage wouldn’t be that big an issue, though. I would need to benchmark to find out.

1 Like

I mean advisory locks are ultimately rows too. Managing a job ownership table isn’t gonna produce a whole lot more garbage than dead advisory lock rows. Both will produce some but ultimately we need to just trust VACUUM to do its job properly. It may also be possible to simply put the node information in the job row itself, I need to think through the details some.

EDIT: One option that would help with this specific issue is to just have a single dedicated process per node hold the locks. This probably isn’t performant enough however. I can get away with it in Fable because locks are only acquired on boot and then held for the whole time the node is up. The locks here are per job right?

2 Likes

Right, they are taken whenever a job is moved to the executing stage.

An important additional note/requirement is that the solution works across languages. I will implement Ruby and Python versions (and possibly others) and they need to have compatibility.

Ruby / python job runners? Or do they simply need to be able to enqueue jobs.

EDIT: I guess it doesn’t really matter, all of this is just bare postgres anyway. Honestly a set of node tracking tables is probably easier to manage than the advisory locks in other languages cause it’s hard to deal with the locks when you start using something like pg bouncer.

1 Like

Full job runners, meaning they would take locks, gossip, everything.

Enqueuing jobs is really straight forward, but it only works when all the jobs are in Elixir.

I’ve released Oban v0.8.0, AKA “Ben Wilson Edition”. This is a pretty big release that contains one critical change—the introduction of an table to track heartbeats and dropping the use of advisory locks entirely. This will reduce the amount of log noise (rampant warnings) that we all see, and it improves the reliability of job recovery as well.

If all goes well this is the last big set of changes before 1.0.0 :sunny:

From the CHANGELOG

Added

  • [Oban.Job] Add an attempted_by column used to track the node, queue and producer nonce that attempted to execute a job.

  • [Oban.Beat] Add a new oban_beats table, used by producers to publish “pulse” information including the node, queue, running jobs and other information
    previously published by gossip notifications.

    Beat records older than one hour are pruned automatically. Beat pruning respects the :disabled setting, but it ignores length and age configuration. The goal is to prevent bloating the table with useless historic information—each queue generates one beat a second, 3,600 beat records per hour even when the queue is idle.

Changed

  • [Oban.Executor] Don’t swallow an ArgumentError when raised by a worker’s backoff function. @balexand

  • [Oban.Notifier] Remove gossip notifications entirely, superseded by pulse activity written to oban_beats.

  • [Oban.Query] Remove all use of advisory locks!

  • [Oban.Producer] Periodically attempt to rescue orphans, not just at startup. By default a rescue is attempted once a minute and it checks for any executing jobs belonging to a producer that hasn’t had a pulse for more than a minute.

Fixed

  • [Oban.Worker] Validate worker options after the module is compiled. This allows dynamic configuration of compile time settings via module attributes, functions, Application.get_env/3, etc.

  • [Oban.Query] Remove scheduled_at check from job fetching query. This could prevent available jobs from dispatching when the database’s time differed from the system time. @seanwash

  • [Oban.Migrations] Fix off-by-one error when detecting the version to migrate up from.

8 Likes

How different is this compared to using something like broadway?

:heart_eyes:

Oban in general? I mean it’s a persistent job queue in the classical sense, whereas Broadway is more a data processing pipeline. The difference is that each item in a job queue is usually a discrete piece of work, whereas individual items in a data processing pipeline are usually part of some data stream. This has a lot of implications with respect to API, implementation, performance, etc.

3 Likes

Is this (heartbeat/pulse tracking) an established/well known technique? I tried to google it but didn’t find relevant results (btw, the search query “postgres pulse tracking” yields the Oban Github repo as the first hit :slight_smile: )

2 Likes

Many systems use heartbeats to indicate that they are still alive (i.e. Phoenix Channels), but they aren’t typically persistent. The technique used in Oban is derived from how Kiq/Sidekiq use heartbeats to track activity. Those libraries are built on Redis and use a complex system of hash objects and timeouts, while Oban is able to use a single table for tracking.

2 Likes

Since using Oban in a couple apps, my Postgres logs grow by tens of gigabytes per day. I suppose it may be due to the heartbeat. The logs actually consumed all disk and crashed one of my servers. Now that I’m aware it’s easy to prevent. Oban works great, but pay attention to the size of your Postgres logs!

1 Like

The log size issue was due to advisory locks, which are no longer used, largely for that exact reason!

Advisory locks themselves are very lightweight, but they are owned by the connection that took the lock initially. When any other connection attempts to release the lock Postgres logs a big fat WARNING. This situation happens constantly because the connection checked out from the db pool is random—the larger the pool the less likely that the checked out connection owned the advisory lock.

The new system writes rows to an oban_beats table and operates warning free. Definitely upgrade to 0.8.0 if you haven’t already!

6 Likes

Aah - didn’t read the notice!! Just upgraded. From midnight to 06:00 with old Oban, Postgres generated 13Gb of log data. After upgrade and watching the system for 30 minutes with two live apps, I see zero heartbeat data in the postgres logs. Yahoo! :slightly_smiling_face:

3 Likes

Really pleased to hear that it worked as expected! :yellow_heart:

As a side benefit, the new approach is more reliable—if a producer crashes the jobs can be recovered right away on startup, whereas with advisory locks they wouldn’t be recovered until the system restarted. (I wrote about that in more detail on Slack through a thread with @chrismccord, but it isn’t available anymore).

Same! Just FYI here’s a screenshot from one of my Oban apps - a WIP cron-style job scheduler used internally. Oban and LiveView work well together!

Here’s the source: GitHub - andyl/jobex: Cron-like Workflow

9 Likes

Is there a way to use Oban.insert in a multi, with function passing in order to get a result from a previous insert’s result?

1 Like

Calling it within Ecto.Multi.run should work just fine.

1 Like

@sorentwo is there way to check out the Web UI, that is in development ?

1 Like

@LostKobrakai is correct, you can call it with Multi.run just fine. That’s all that Oban’s Multi support is really doing anyhow.

Okay, thank you. That is what I was doing, and it’s working fine, but wasn’t sure if that was the correct method.