Data processing: aim for 100% scheduler utilization?

david_ex · June 5, 2020, 8:11pm

This is kind of a noob question, but I can’t find any info/guidelines online…

In a data-processing application [1], what should the “ideal” scheduler utilization be? Should the aim be for 100% utilization so that all resources are being used for the processing, banking on the fact that any user interaction will still be able to be served due to the BEAM’s preemptive scheduling (i.e. accept slow user responses to increase data-processing throughput)? Or would it be better to aim for, say, 85% utilization so the system has some “breathing room”?

I realize the answer is probably going to be some form of “it depends”, but on what criteria? In my particular case, the load would be essentially stable as it would be processing data in a DB via GenStage pipelines (i.e. leveraging back pressure mechanisms).

[1] i.e. with limited user interaction, and those users would only be “admin”-type users sometimes querying the DB to verify everything is ok

dimitarvp · June 6, 2020, 4:39pm

In my experience you’d still have BEAM’s lag tolerance even if you actively tried to break it. I mean, that’s not entirely true; if you use NIFs that work for longer periods of time (more than 1s) without yielding back to the BEAM then you will manage to break its lag-less guarantees somewhat. But otherwise it will still hum along and be responsive just fine.

But if you are still worried, and if you are using e.g. Task.async_stream for your data processing workflows you can just tune it to not use all CPU cores – f.ex. you can pass max_concurrency: System.schedulers_online() - 1 so you reserve one CPU thread for the not-very-active web app that is supposed to also run in the same VM. And if you have actual backpressure (via GenStage et. al.) then you would need to do exactly nothing.

Beyond all that I probably don’t need to tell you that you should take special care not to exhaust all possible DB connections. If your Postgres server allows for 100 at maximum and your data processing uses all of them then obviously the web app will hang and/or error with a timeout eventually.