I’d appreciate some help proving (or disproving) a hypothesis I have about Oban’s overhead when run on multiple nodes.
Background
-
oban 2.17.12
-
oban_pro 1.14.10
-
oban_notifiers_phoenix 0.1.0
-
engine: Oban.Pro.Engines.Smart
-
notifier: Oban.Notifiers.Phoenix
-
database: AWS Aurora RDS postgresql
Caveat: we’re aware that using Aurora with Oban is not ideal. Moving Oban off of Aurora isn’t a short term option, so we’re investigating opportunities to improve performance on Aurora.
Aurora provides this dashboard that shows a representation of database load by SQL statement. The yellow category of load is IO:XactSync
. My understanding is that txn commits force aurora to flush to object storage, so commits are particularly expensive in Aurora compared to vanilla PG (this may be one of the issues with using oban with aurora).
Conclusions from that screenshot:
- our top single source of database load is from a
COMMIT
issued by Oban- we’ve auto-injected comments into some statements to see where they originate, hence the
-- Switchboard.Repo.ObanWrapper.transaction/2 (L14)
comment
- we’ve auto-injected comments into some statements to see where they originate, hence the
- That commit has a very high Calls/sec, dwarfing any of our other top statements in QPS.
- There’s another source of load that has a similar QPS:
WITH "subset" AS (SELECT ...
I found this query in the smart engine - it’s called byfetch_jobs
and is used to fetch a batch of jobs and mark them as executing.
Here’s my hypothesis: For every job insertion (or insertion batch) Oban notifies every k8s pod running Oban. Upon receiving the notification, each pod running Oban calls fetch_jobs
(possibly once per queue, though I’m not sure). Each fetch_jobs
call generates a transaction commit. Committing transactions is quite expensive with Aurora, so what should be a relatively lightweight (albeit high frequency) operation is disproportionally expensive in Aurora-land. Furthermore, we’ve been using k8s pods as a convenient unit of scaling throughput for job processing. But if this hypothesis is correct, adding pods significantly increases load on our DB, creating a bottleneck.
To test this hypothesis, I played with our staging environment’s pod count and saw notable fluctuations in commit volume that matched changes in pod count (even though the rate of job processing did not change).
My conclusion is that if we want to keep the low latency triggering afforded by Oban Notifiers, we should try running fewer, beefier pods. We could tune the system to have the same total number of workers & cpu cores working on fewer pods. Fewer pods would lead to a lower rate of fetch_job
calls, and fewer commits to Aurora.
Alternatively, we could investigate some sort of debounce in the Notifier, or switch off the notifier to polling.
Would love to read thoughts on this. Am I misunderstanding any things about how Oban or Aurora work? Does it make sense that running fewer oban nodes would reduce our commit volume (by running fewer of those WITH "subset"...FOR UPDATE SKIP LOCKED
transactions). Are we missing any key performance improvements from later versions of Oban?
Besides this, does anyone have thoughts to share on optimizing Oban throughput on Aurora?
Thanks for the help!