Run all background jobs in a different server

I have a dilema that I would like to hear your input on:
We have a server that serves a decently high rate of api calls and needs to stay as fast as possible.
In our business logic there are a bunch of background jobs that need to run and that are not time critical, there is no big rush in finishing them, like orders export, notifications, scheduled tasks like generating invoices, etc. They can take significant CPU though.
Currently we run them via Oban inside the same servers that serve the API and the rest of our business logic. My concern is that they can impact performance and starve the CPU which will affect the performance of the APIā€™s and websockets.

Would be more appropriate to create another umbrella app or something that I can pack differently via mix releases and deploy to a separate server that has no rush in finishing the tasks and itā€™s not a problem if saturates the CPU?
How do you guys do it??

There are a few questions that need to be answered. What does ā€œa decently high rate of api callsā€ mean? Do the API request handling put high load on the CPU relative to RAM? It usually doesnā€™t but history has seen many different cases. Maybe a load/stress test can show how much requests and background tasks your service can take?

There are 2 ways to go about this:

  1. Rate limit the processing - use the Oban rate limiter (not sure how it works and it seems you need Oban pro) or Broadway or Genstage, this will save a lot of complexity and time in general.
  2. Use RabbitMQ or built-in erlang message passing from node to node to pass events to another server that will do the processing. In this case you need to decide whether losing events is acceptable and a strategy of what happens when something goes down, much more complex compared to the first solution.

I get where you are coming from and I am a paranoid prepper myself but Iā€™d advise you to measure first before going into a potential rabbit hole.

2 Likes

I agree that I may be getting to a potential Rabbit hole. I will leave things as is for now and try instead to add more metrics and limit background workers concurrency to a certain level.

Something to remember when you are developing in the BEAM: The scheduler is preemptive, operating on a time-sharing principle. This means that the VM will actually pause long-running processes (such as your background jobs) from time to time in order to let other processes (such as your web requests, which should be very short-lived) have a turn. This is done by allocating a specific number of reductions (function calls) to each process, after which the scheduler will switch to the next process in the queue.

The primary advantage of this approach is that it ensures every process gets some CPU time even under very heavy load, promoting fairness and responsiveness. It is not to say that your background jobs will have no negative impact on the time it takes to finish processing a request, but itā€™s not nearly as detrimental as a cooperative scheduler, which would not pause processes and just lets them run until they finish what they are doing. This can lead to issues like process monopolization, where a single process takes over the CPU and doesnā€™t allow other processes to run.

In Erlangā€™s system, the preemptive scheduling helps in distributing resources evenly, maintaining system responsiveness, and avoiding the risk of any single process overwhelming the system. This design reflects Erlangā€™s focus on concurrency, fault tolerance, and distributed computing, making it particularly suitable for scalable and highly available systems.

3 Likes

Yep, absolutely add metrics and/or telemetry.

GitHub - openobserve/openobserve: šŸš€ 10x easier, šŸš€ 140x lower storage cost, šŸš€ high performance, šŸš€ petabyte scale - Elasticsearch/Splunk/Datadog alternative for šŸš€ (logs, metrics, traces). is a pretty good self-hosted solution. I already love it.

1 Like

We run our Oban workers on separate VMs (Heroku dynos) that donā€™t serve web traffic. While the BEAM scheduler is great, itā€™s useful having a separate CPU/memory environment for each that can be scaled independently.

We do this by disabling the queues on the web workers with an environment variable which is read by the runtime.exs config.

2 Likes

Interesting approach. Do you have any kind of autoscale in place?
For example how do you scale your heroku dynos based on jobs in the queue.

I have not hear about it before. I will definitely try it.
What kind of metrics do you track?

No, our main bottleneck is our database and auto scaling up more dynos would just increase the load on this to the point where it had an effect on web users. Auto scaling down during quiet periods might save us some money, but itā€™s not a cost thatā€™s meaningful right now.

Not the person you are replying to, but we also use the same approach.

There are a series of web servers that just serve HTTP traffic, and then there are a series of background servers that just run Oban jobs. The server kind is identified by an environment variable that we check to conditionally start Oban queues or not.

We didnā€™t need any autoscaling yet so we donā€™t have any solution for working with that. Anyway, Oban Pro has the DynamicScaler plugin that can do autoscaling for you.

1 Like

I did not know about that plugin. May be useful sometime where there are jobs that are tied to main db like importing external data to S3 or media processing.
Thanks

This approach is very valid. From financial companies I know that they usually separate their core systems with high availability requirements from other tasks such as background batch jobs, analytics, reporting, and additional services.

To do this you basically have to copy your database and use this copy for everything you donā€˜t want to run in the core system. Copying the database proves to be a challenging task in itself which needs a dedicated team that does the loading on a daily basis and fixes any load issues that regularly arise. Additionally, the database copy will never be in complete sync with the core database.

@JeyHey you brought up an interesting subject about duplicating the main database or restoring the daily backup to another server and using that for reporting or data extraction tasks. It ca not be used by workers that ingest data into the main database, but for data extraction that does not need to be realtime can be a good solution.
As @dimitarvp said though we need to be careful to not go into all these kind of rabbit holes just for premature optimizations. Only if the metrics tell us that there is something we need to do about it.

For the moment only the duration of a span. Just the fact that youā€™re starting and then ending a span already automatically gives you a duration. And they can be nested (meaning you can measure the duration of a web response and the DB time thatā€™s part of it, for example). You can also attach all sorts of data to the traces / spans. Pretty awesome stuff.

Dunno if you checked the GitHub repo or their website. Please do and check out the screenshots (they recently added dark mode btw).