Improving Oban throughput on Aurora RDS

:wave: I’d appreciate some help proving (or disproving) a hypothesis I have about Oban’s overhead when run on multiple nodes.

Background

  • oban 2.17.12

  • oban_pro 1.14.10

  • oban_notifiers_phoenix 0.1.0

  • engine: Oban.Pro.Engines.Smart

  • notifier: Oban.Notifiers.Phoenix

  • database: AWS Aurora RDS postgresql

Caveat: we’re aware that using Aurora with Oban is not ideal. Moving Oban off of Aurora isn’t a short term option, so we’re investigating opportunities to improve performance on Aurora.

Aurora provides this dashboard that shows a representation of database load by SQL statement. The yellow category of load is IO:XactSync. My understanding is that txn commits force aurora to flush to object storage, so commits are particularly expensive in Aurora compared to vanilla PG (this may be one of the issues with using oban with aurora).

Conclusions from that screenshot:

  • our top single source of database load is from a COMMIT issued by Oban
    • we’ve auto-injected comments into some statements to see where they originate, hence the -- Switchboard.Repo.ObanWrapper.transaction/2 (L14) comment
  • That commit has a very high Calls/sec, dwarfing any of our other top statements in QPS.
  • There’s another source of load that has a similar QPS: WITH "subset" AS (SELECT ... I found this query in the smart engine - it’s called by fetch_jobs and is used to fetch a batch of jobs and mark them as executing.

Here’s my hypothesis: For every job insertion (or insertion batch) Oban notifies every k8s pod running Oban. Upon receiving the notification, each pod running Oban calls fetch_jobs (possibly once per queue, though I’m not sure). Each fetch_jobs call generates a transaction commit. Committing transactions is quite expensive with Aurora, so what should be a relatively lightweight (albeit high frequency) operation is disproportionally expensive in Aurora-land. Furthermore, we’ve been using k8s pods as a convenient unit of scaling throughput for job processing. But if this hypothesis is correct, adding pods significantly increases load on our DB, creating a bottleneck.

To test this hypothesis, I played with our staging environment’s pod count and saw notable fluctuations in commit volume that matched changes in pod count (even though the rate of job processing did not change).

My conclusion is that if we want to keep the low latency triggering afforded by Oban Notifiers, we should try running fewer, beefier pods. We could tune the system to have the same total number of workers & cpu cores working on fewer pods. Fewer pods would lead to a lower rate of fetch_job calls, and fewer commits to Aurora.

Alternatively, we could investigate some sort of debounce in the Notifier, or switch off the notifier to polling.

Would love to read thoughts on this. Am I misunderstanding any things about how Oban or Aurora work? Does it make sense that running fewer oban nodes would reduce our commit volume (by running fewer of those WITH "subset"...FOR UPDATE SKIP LOCKED transactions). Are we missing any key performance improvements from later versions of Oban?

Besides this, does anyone have thoughts to share on optimizing Oban throughput on Aurora?

Thanks for the help!

2 Likes

That’s fascinating. It explains a great deal about why Aurora instances have such IO issues with Oban compared to other database types. Previously, I thought it was due to a lack of unlogged tables and IO throttling.

That’s true if the insert_trigger option is enabled, which is the default. You can disable the trigger at the expense of less responsive insert handling (up to a second instead of sub-second).

There’s also the dispatch_cooldown option to control how long a producer waits between subsequent fetches. The default is 5ms, and the goal is to prevent thrashing the database with rapid fetch_jobs requests. You can tune the value to force fewer fetch requests at the expense of lower throughput.

That seems like a great idea, and worthy of being included in the Troubleshooting guide for Aurora users.

That makes sense to me. Fewer nodes, combined with tweaks to insert notifications and dispatch cooldown, could reduce the total number of COMMIT operations tremendously. Something important to note is that SKIP LOCKED requires a transaction or it has no effect.

No you aren’t. I’m not sure there’s anything else Oban can do here (well, not much more Pro can do, it has much more optimized acking/fetching compared to OSS).

Please report back if/when you make changes and let us know how it goes!

6 Likes

Adjacent to this, Postgrex recommends setting the :endpoints parameter when using Aurora. This may be useful in the Oban Troubleshooting guide as well, as we have found that Oban requires manual intervention to recover from an Aurora failover. When using this Postgrex :endpoints feature, however, Oban recovers automatically.

6 Likes

That’s cool, thank you very much for sharing! I didn’t know this and manually built a GenServer a while ago that checks the session status for a failover and restarts the repo :smiley:

1 Like

I only found it after working on failover this past week. It’s not as fast as detecting failover and resetting the conn pool, but it seems to handle the issue very gracefully.

2 Likes

Quick update:

4 Likes