vlymar

Improving Oban throughput on Aurora RDS

I’d appreciate some help proving (or disproving) a hypothesis I have about Oban’s overhead when run on multiple nodes.

Background

oban 2.17.12
oban_pro 1.14.10
oban_notifiers_phoenix 0.1.0
engine: Oban.Pro.Engines.Smart
notifier: Oban.Notifiers.Phoenix
database: AWS Aurora RDS postgresql

Caveat: we’re aware that using Aurora with Oban is not ideal. Moving Oban off of Aurora isn’t a short term option, so we’re investigating opportunities to improve performance on Aurora.

Aurora provides this dashboard that shows a representation of database load by SQL statement. The yellow category of load is IO:XactSync. My understanding is that txn commits force aurora to flush to object storage, so commits are particularly expensive in Aurora compared to vanilla PG (this may be one of the issues with using oban with aurora).

Conclusions from that screenshot:

our top single source of database load is from a COMMIT issued by Oban
- we’ve auto-injected comments into some statements to see where they originate, hence the -- Switchboard.Repo.ObanWrapper.transaction/2 (L14) comment
That commit has a very high Calls/sec, dwarfing any of our other top statements in QPS.
There’s another source of load that has a similar QPS: WITH "subset" AS (SELECT ... I found this query in the smart engine - it’s called by fetch_jobs and is used to fetch a batch of jobs and mark them as executing.

Here’s my hypothesis: For every job insertion (or insertion batch) Oban notifies every k8s pod running Oban. Upon receiving the notification, each pod running Oban calls fetch_jobs (possibly once per queue, though I’m not sure). Each fetch_jobs call generates a transaction commit. Committing transactions is quite expensive with Aurora, so what should be a relatively lightweight (albeit high frequency) operation is disproportionally expensive in Aurora-land. Furthermore, we’ve been using k8s pods as a convenient unit of scaling throughput for job processing. But if this hypothesis is correct, adding pods significantly increases load on our DB, creating a bottleneck.

To test this hypothesis, I played with our staging environment’s pod count and saw notable fluctuations in commit volume that matched changes in pod count (even though the rate of job processing did not change).

My conclusion is that if we want to keep the low latency triggering afforded by Oban Notifiers, we should try running fewer, beefier pods. We could tune the system to have the same total number of workers & cpu cores working on fewer pods. Fewer pods would lead to a lower rate of fetch_job calls, and fewer commits to Aurora.

Alternatively, we could investigate some sort of debounce in the Notifier, or switch off the notifier to polling.

Would love to read thoughts on this. Am I misunderstanding any things about how Oban or Aurora work? Does it make sense that running fewer oban nodes would reduce our commit volume (by running fewer of those WITH "subset"...FOR UPDATE SKIP LOCKED transactions). Are we missing any key performance improvements from later versions of Oban?

Besides this, does anyone have thoughts to share on optimizing Oban throughput on Aurora?

Thanks for the help!

5 comments

/oban #oban-pro

21 505 5

2024-12-20 19:41:38 UTC

Most Liked

sorentwo

Oban Core Team

That’s fascinating. It explains a great deal about why Aurora instances have such IO issues with Oban compared to other database types. Previously, I thought it was due to a lack of unlogged tables and IO throttling.

That’s true if the insert_trigger option is enabled, which is the default. You can disable the trigger at the expense of less responsive insert handling (up to a second instead of sub-second).

There’s also the dispatch_cooldown option to control how long a producer waits between subsequent fetches. The default is 5ms, and the goal is to prevent thrashing the database with rapid fetch_jobs requests. You can tune the value to force fewer fetch requests at the expense of lower throughput.

That seems like a great idea, and worthy of being included in the Troubleshooting guide for Aurora users.

That makes sense to me. Fewer nodes, combined with tweaks to insert notifications and dispatch cooldown, could reduce the total number of COMMIT operations tremendously. Something important to note is that SKIP LOCKED requires a transaction or it has no effect.

No you aren’t. I’m not sure there’s anything else Oban can do here (well, not much more Pro can do, it has much more optimized acking/fetching compared to OSS).

Please report back if/when you make changes and let us know how it goes!

Post #2

brentjanderson

Adjacent to this, Postgrex recommends setting the :endpoints parameter when using Aurora. This may be useful in the Oban Troubleshooting guide as well, as we have found that Oban requires manual intervention to recover from an Aurora failover. When using this Postgrex :endpoints feature, however, Oban recovers automatically.

Post #3

vlymar

Quick update:

enabling Aurora IO-optimized mode reduced our commit latency which in turn reduced the impact of Oban’s high frequency of commits in our system.

CleanShot 2024-12-20 at 11.30.07@2x1200×460 26.7 KB
we’re saving dispatch_cooldown and pod count tuning for the new year, will report back when we have some results

Post #6

Where Next?

View thread on forum (has 5 responses!)

oban

oban-pro

Home Questions & Help>Questions

/oban #oban-pro

21 505 5

Last post

Other popular topics

News>Announcing

Oban - Reliable and Observable Job Processing

Hello! tl;dr Announcing Oban, an Ecto based job processing library with a focus on reliability and historical observability. After spen...

/oban #ecto #postgres #job-processing

985 42920 311

2026-03-25 15:49:12 UTC

New

Questions & Help>Questions

Idiomatic guard clause for checking not nil

What is the idiomatic way of matching for not nil in Elixir? E.g., First way: defp halt_if_not_signed_in(conn, signed_in_account) when...

#guard-clauses

13 43527 3

2018-11-28 20:03:07 UTC

New

Questions & Help>Questions

(Postgrex.Error) FATAL 28P01 (invalid_password) password authentication failed for user “postgres”

After calling mix ecto.create I get this error: 17:00:32.162 [error] GenServer #PID<0.412.0> terminating ** (Postgrex.Error) FATAL...

#ecto #postgres #troubleshooting

10 29754 20

2023-03-18 06:56:50 UTC

New

Questions & Help>Questions

How is it possible to get 2 million websocket connections when you have 65536 available ports?

I have a server on AWS, and was running a load test using artillery. When looking at the Phoenix dashboard I see the Ports going to 100% ...

/phoenix

20 18548 4

2023-01-24 00:21:16 UTC

New

Questions & Help>Questions

What do you think of Gleam compared to Elixir?

I have a relationship of love and hate with Elixir. Lots of things are just absolutely right, but there are some things that are kind of ...

#programminguages #gleam

24 17513 10

2023-04-08 20:09:27 UTC

New

Questions & Help>Questions

Enum.map over list of key/value pairs with a map as the value

As the title describes, I’m trying to run Enum.map() over a list of key/value pairs, where the value is a map. My data looks like this: ...

#enummap

7 19866 6

2019-10-12 19:16:31 UTC

New

News>Announcing

IntelliJ Elixir - Elixir plugin for JetBrain's IntelliJ Platform

Elixir plugin for JetBrain’s IntelliJ Platform (including Rubymine) This is a plugin that adds support for Elixir to JetBrains IntelliJ...

#library #intellij

289 36128 110

2024-01-11 21:01:54 UTC

New

Questions & Help>Questions

Transform a list into an map with indexes using Enum module

Hi, I need to transform a list of numbers into a map where the keys are the indexes and the values are the original values of the list. ...

35 32831 9

2016-09-01 23:06:05 UTC

New

Chat & Discussions>Wikis

Phoenix LiveView Info

We’ve put together this wiki for Phoenix LiveView - please feel free to add any info you feel is worth including. What is Phoenix LiveV...

/phoenix #wiki #stickies #liveview

220 25886 73

2019-08-06 22:22:55 UTC

New

Questions & Help>Questions

Why would I choose Elixir as a general purpose programming language?

In asking this question I am more interested about the expressiveness of the language itself and less concerned about the availability of...

#functional-programming #use-cases

65 34961 13

2020-01-05 04:29:20 UTC

New

Latest Oban Threads

Oban v2.23.0 released!

News>News & Updates

Cache hit ratio on oban_jobs at 65% - large completed backlog, is this expected?

Questions & Help>Troubleshooting

Andy LeClair - Principal/Staff Full Stack Engineer

Jobs & Member Profiles>Member Profiles

Tick-to-Trade in Elixir: A GenServer handle_info Pipeline

Blogs & Podcasts>Blog Posts

Concurrent index migration fails on partitioned oban_jobs

Questions & Help>Troubleshooting

Designing Data Flow in an Elixir Trading System

Blogs & Podcasts>Blog Posts

How is Oban Pro 1.7 performance?

Questions & Help>Questions

Nested grafts + Oban.Pro.Workflow.status = Infinite recursion

Questions & Help>Troubleshooting

Ocelot - a lightweight Oban dashboard that works without Phoenix

Blogs & Podcasts>Blog Posts

Oban Chore - A LiveView dashboard and plugin for managing Oban tasks

News>Announcing

Oban Forum ❯

Questions & Help>Questions

Help with elixir-ts-mode in doom-emacs config

Questions & Help>Questions

Are Vi keybindings possible inside IEx?

Questions & Help>Questions

I miss the ternary operator - does anyone have a macro that allows a ternary operator in Elixir code?

Questions & Help>Questions

Empty Result on Generic Action with graphql_unnested_unions

Questions & Help>Questions

Clarification about `assign/2,3` usage in `render/1` callbacks

Questions & Help>Questions

With the new 1.20 release does it change the way you see Gleam?

Questions & Help>Questions

Using Phoenix.LiveView.TagEngine as an EEx.Engine is deprecated!

Questions & Help>Questions

About ambiguity introduced in function default arguments

Questions & Help>Questions

OpenApiSpex schema - are there any naming conventions on handling show and index routes?

Questions & Help>Questions

How to get type warnings before test failure reports

Questions & Help>Questions

Questions Questions ❯

Latest on Elixir Forum

Senior Full Stack Engineer (Elixir, React) - Rabbet, Austin, Remote USA (TX, CO, NC preferred)

Jobs & Member Profiles>Jobs

2026/09/09 - Building Local-First Apps in Pure Elixir with Hologram (ElixirConf US training) - Chicago, USA

Events/Confs/Meet Ups>List

Let libraries be libraries

Blogs & Podcasts>Blog Posts

Nature_whistle v0.3.0 is out - telemetry driven alerting with recovery notifications

News>News & Updates

What do we do with logging in libraries?

Blogs & Podcasts>Blog Posts

Software Engineer - Soluna, Remote USA

Jobs & Member Profiles>Jobs

Getting tsvectors: error] ** (Postgrex.Error) ERROR 42703 (undefined_column) record “new” has no field “business_id”

Questions & Help>Troubleshooting

Keynote: DurableServer: Always Running Somewhere - Chris McCord | ElixirConf EU

Learning Resources>Talks

finance - XIRR, NPV and other financial calcs matching Excel/Sheets

News>Announcing

Oaskit 0.14.1 - security release

News>News & Updates

API Management Console - runtime route toggling for Phoenix apps

News>Announcing

Testers wanted: protocol pruning for smaller client bundles

News>RFCs

Update from the Phoenix Team - Steffen Deusch | ElixirConf EU

Learning Resources>Talks

Mob 0.7.14 released!

News>News & Updates

LT: Sherlock: The truth is in the code - Aleksandr Lossenko | ElixirConf EU

Learning Resources>Talks

Elixir Forum ❯

Sub Categories:

Forums

We're in Beta

About us Mission Statement

Improving Oban throughput on Aurora RDS

vlymar

Improving Oban throughput on Aurora RDS

Most Liked

sorentwo

brentjanderson

vlymar

Where Next?

Popular in Questions

Javascript Event Handler is firing twice

Defining Ecto Schemas (PostGres) to use NOT NULL in column definitions

How to convert map to string (separated with ,)

Where / How does the Mix environment variable get set?

How to bind phoenix app to specific ip address?

Ecto query using like/ilike in query

Protocol Enumerable not implemented for

Starship (cross-shell prompt) error - (starship::utils): Executing command "elixir" timed out

Updating structs: Map.put vs %Foo{oldfoo | new: value} vs put_in

Good auth solutions for Elixir/Phoenix?

Other popular topics

Oban - Reliable and Observable Job Processing

Idiomatic guard clause for checking not nil

(Postgrex.Error) FATAL 28P01 (invalid_password) password authentication failed for user “postgres”

How is it possible to get 2 million websocket connections when you have 65536 available ports?

What do you think of Gleam compared to Elixir?

Enum.map over list of key/value pairs with a map as the value

IntelliJ Elixir - Elixir plugin for JetBrain's IntelliJ Platform

Transform a list into an map with indexes using Enum module

Phoenix LiveView Info

Why would I choose Elixir as a general purpose programming language?

Latest Oban Threads

Questions & Help>Questions

Latest on Elixir Forum

Categories:

Sub Categories:

Forums

Popular Tags

We're in Beta

Improving Oban throughput on Aurora RDS

vlymar

Improving Oban throughput on Aurora RDS

Most Liked

sorentwo

brentjanderson

vlymar

Where Next?

Popular in Questions

Javascript Event Handler is firing twice

Defining Ecto Schemas (PostGres) to use NOT NULL in column definitions

How to convert map to string (separated with ,)

Where / How does the Mix environment variable get set?

How to bind phoenix app to specific ip address?

Ecto query using like/ilike in query

Protocol Enumerable not implemented for

Starship (cross-shell prompt) error - (starship::utils): Executing command "elixir" timed out

Updating structs: Map.put vs %Foo{oldfoo | new: value} vs put_in

Good auth solutions for Elixir/Phoenix?

Other popular topics

Oban - Reliable and Observable Job Processing

Idiomatic guard clause for checking not nil

(Postgrex.Error) FATAL 28P01 (invalid_password) password authentication failed for user “postgres”

How is it possible to get 2 million websocket connections when you have 65536 available ports?

What do you think of Gleam compared to Elixir?

Enum.map over list of key/value pairs with a map as the value

IntelliJ Elixir - Elixir plugin for JetBrain's IntelliJ Platform

Transform a list into an map with indexes using Enum module

Phoenix LiveView Info

Why would I choose Elixir as a general purpose programming language?

Latest Oban Threads

Questions & Help>Questions

Latest on Elixir Forum

Sponsor Spotlight

Our Sponsors

Categories:

Sub Categories:

Forums

Popular Tags

Our Sponsors

We're in Beta