Hi everyone,
I am working on a project where I have to spawn many processes (50k+).
Each process represents an entity that must move on a map.
The behavior of an entity is simple:
- Each entity choose a random coordinate in a radius around it.
- Then it calculates the path to go there
- Once arrived at destination, it pauses for a random time between 850 and 1150 ms
- It starts again to choose a random coordinate
- etc…
The pathfinding algorithm is written in Rust (NIF) for performance reasons
So I started to do my first tests with only 10k processes.
Unfortunately, I found that just having 10,000 processes spawn looking for a path and then waiting was already permanently using between 45-55% CPU on my Windows.
Curiously, when I did my tests on WSL, the same algorithm takes only 5 to 10% of CPU on WSL2.
This huge difference in CPU consumption is the first thing I can’t understand/explain.
I then tried to benchmark my pathfinding function using Benchee to see if that was the source of my problems.
I got theses results:
Name ips average deviation median 99th %
astar 355.56 K 2.81 μs ±863.72% 2.20 μs 18.60 μs
According to these results, if I take the average execution time of a function, calling 10,000 times the pathfinding function should take only 28.1ms.
So normally, the CPU should not even reach 1% (except maybe when launching the application).
This is the second thing I don’t understand: why the CPU is permanently busy.
Having never worked with so many processes, I don’t know where to start in order to debug such problems.
The code used for my tests and benchmark is available here: GitHub - ImNotAVirus/elixir_nif_example
This code has been simplified to include only the spawn of the 10k processes, the call to the pathfinding function and the pause of workers.
Thanks in advance