I’ve been building an open-source multiplayer game server framework in Elixir, and I wanted to share the results.
The Challenge: Can the BEAM handle MMO-scale real-time simulation? The conventional wisdom says “use C++ for game servers.” I wanted to prove otherwise.
The Results:
Metric
Target
Achieved
Tick time (10K entities)
< 29ms
8ms avg, 13ms max
P99 tick time
< 29ms
11.1ms
Bytes per entity
< 20
18 (template bitpacking)
Per-player bandwidth
< 1 MB/s
264 KB/s
Full broadcast bandwidth
< 100 MB/s
5.2 MB/s
Crash recovery
< 100ms
4.6ms
Key Techniques:
Bucket-parallel tick loop — Chunk entities across schedulers, not Task-per-entity
Fused encoding — Tick + serialize in one pass while data is hot in L1 cache
:maps.from_list — 9x faster than Map.put in a reduce (TIL!)
IO lists everywhere — Never concatenate binaries, let writev gather fragments
Time-travel debugging — Circular buffer with structural sharing for replay
Cross-zone shadows — PubSub-based visibility across zone boundaries
Async persistence — Snapshot + journal with atomic writes
The “aha moment”: Killing the 68ms sequential serialization bottleneck. Fused parallel encoding brought it down to 2ms for a typical player’s visible area (500 entities).
Architecture highlights:
Hybrid entity model (players as processes, NPCs as data in zone state)
ETS spatial grid for O(1) neighbor queries
Binary protocol with reliable_seq gap detection
Dead reckoning for shadow entity interpolation
Hysteresis on zone boundaries to prevent oscillation
Test suite: 340 tests, 0 failures
Still early days, but the engine handles the “10K goblin stress test” without breaking a sweat. Next up: TypeScript client SDK and a simple LiveView visualizer.
Curious if others have pushed the BEAM for real-time simulation workloads. What’s your experience?
Hey, this is interesting as I didn’t see too many benchmarks of beam for performance, but what exactly is being tested here? I can see the metrics but don’t exactly understand the test scenario.
Is it like simulation of ‘10k players in the area’? If so then how many actions are they taking on each tick? I’ve read the comment 3 times and don’t see that info.
It’s interesting how scaling of that solution could look like (I mean how changes/effort would you have to make to have 10k simulations of 10k entities) and how much could you delegate to Zig/Rust if need be.
Nice! I’ve been (slowly) working on a World of Warcraft server, Thistle Tea, that has some similarities.
In Thistle Tea, players and NPCs are processes, but I’ve abstracted the interface to use GUIDs. The idea is that the boundary process layer will be straightforward enough to swap for something that groups entities by zone instead if it ever becomes necessary, without needing to change (much) game logic code.
IO lists are a good idea, they’ve been on my list to benchmark sometime but haven’t got around to it. Right now building packets is just binary concatenation. Bandwidth measurements are a smart idea, I have latency metrics but bandwidth would helpful to understand the entire system performance.
I’m also using ETS for a spatial grid and it’s been working really well. Recently also started using ETS for per entity metadata that other processes need access to in hot loops, that seems to work decent to avoid the message passing overhead where necessary.
I’ve mostly been focusing on building out functionality, recently got a rudimentary NPC AI wired up using behavior trees to get the basics of combat working. Stuff’s tricky because the client really likes crashing if packets are even slightly malformed, and some of the expected packet layouts aren’t very clear, lol.
It uses a Rust NIF for pathfinding, which would’ve been a pain to implement from scratch. Still a bit of wonkiness there for me to debug sometime, though. Mobs like to stutter around a bit when repathing, which makes me think the server isn’t perfectly simulating movement the same way the client is (or some other bug).
Here are a few blog posts I wrote about the process if you’re interested.
I spent some time reading your devlogs on pikdum.dev—the work you’re doing with Recast/Detour is top-tier.
I noticed you mentioned ‘wonkiness’ and ‘stuttering’ during repathing with the Rust NIF.
We actually agonized over the NIF vs. Sidecar decision for Phase 4 of Realm. We ultimately chose a Sidecar-via-Unix-Domain-Socket approach, and I wanted to share some ‘honest’ statistics that might be interesting for your consolidation move.
The ‘No-Stutter’ SIGKILL Test:
We were scared that the marshalling tax of a Sidecar would kill our 30Hz budget, but we prioritized Fault Isolation. We just ran a ‘Zombie Recovery’ test with 10,000 entities. We executed a SIGKILL on the Rust Sidecar mid-compute.
Result: The Elixir Zone detected the failure and fell back to our internal Grid-based collision in 31.6ms.
The Win: The world didn’t stutter. The tick stayed within the 33.3ms budget, and the players never knew the ‘brain’ had died.
The ‘Profitability’ Crossover:
We also found a fascinating crossover point in our ‘Apples-to-Apples’ benchmark between pure Elixir (Optimized Grid) and the Rust Sidecar (UDS):
< 500 entities: Pure Elixir (O(N²)) is faster.
500 - 5,000 entities: Elixir’s Grid-based O(N) is surprisingly competitive (~1.8ms for 2k entities).
> 5,000 entities: This is where the Sidecar’s parallel rayon compute finally pays for the UDS marshalling tax.
Architecture Move:
To kill the serial bottleneck of the Port, we’re using a Dynamic Socket Pool (24 sockets for 24 schedulers) to maintain scheduler affinity. It keeps the L2 cache hot and the marshalling under 2ms for 10k entities.
Curious if you’ve measured the ‘Tail Latency’ of your NIF? We were worried that a long-running NIF call might be causing the stuttering you see by starving the BEAM schedulers, even with Dirty Schedulers.
Would love to compare notes on your Marshalling overhead vs. ours!
Thank you for sharing your experience and findings. I’m interested to hear why you went with a plain sidecar versus a C node? If you even considered the latter, of course. C nodes give some benefits regarding integration. You can treat your Rust/Zig process as a BEAM node. It can create process identifiers, register global processes and communicate with other processes. It’s all asynchronous using file descriptors. My experience with C nodes is related to sandboxing, not simulations and that’s why it’s interesting to hear other opinions on the topic.
Thanks for the question! C nodes were definitely on the table — they’re the proven OTP approach and the integration benefits are real.
In our case the bottleneck was raw throughput: 10K entities at 30Hz means we’re pushing binary blobs every 33ms. The Erlang distribution protocol adds term encoding overhead that was hard to justify at that frequency.
We ended up with a pool of 24 Unix Domain Sockets (one per scheduler) sending raw binaries with pre-allocated buffers. It’s simpler but it’s fast enough for the workload.
That said, for lower-frequency scenarios or multi-machine setups, C nodes make a lot of sense. Curious about your sandboxing experience — did you run into any latency constraints with the distributed protocol?
2° Round - lessons from the metal: optimizing 10,000 entities on a DL380 (Xeon vs i7)
The Setup
After my last post about 10k entities on a modern i7, I moved the project to a refurbished HP DL380 Gen92014 (Dual Xeon E5-2650 v3 @ 2.30GHz). The goal was to see how the BEAM handles high-frequency simulation on older, high-core-count server hardware using Docker.
The “Xeon Wall”
The first run in Docker was a failure. Code that ran in 11ms on my 5GHz desktop workstation hit 240ms+ on the server. The lower single-core clock speed (2.3GHz) exposed serial bottlenecks that were previously hidden:
The QPI/NUMA Gap: Docker was straddling both physical CPU sockets. Moving data between Socket 0 and Socket 1 (Distance 21) added massive jitter to the memory-heavy AI passes.
The Socket Tax: Moving 320KB of entity data through Linux kernel UDS sockets every 33ms to our Rust physics sidecar was eating ~90ms in context-switching and marshalling.
The Fixes
We had to move from standard “Web” patterns to “Systems” patterns to get under the 33ms budget:
Shared Memory Slab: We replaced Unix Domain Sockets with a shared memory segment in /dev/shm. Elixir and Rust now communicate via a 64-byte aligned “Foundry Slab.” Marshalling time dropped from 88ms to < 1ms.
Docker NUMA Pinning: We used --cpuset-cpus and --cpuset-mems to lock the container to Node 0 (Cores 0-9 and their hyperthreads). This kept the BEAM schedulers and the memory local to the same physical silicon.
The N-1 Pipeline: We deferred the physics merge to the start of the next tick. The workers now “gulp” the previous physics result while they are already reading from ETS for the AI pass, reducing ETS write-pressure.
i7-13700 Workstation: 7-10ms P99 (Running the same “Foundry Slab” architecture)
Takeaway
Running high-performance Elixir in Docker is totally viable, but hardware is not a transparent layer. On older server metal, the “Postman” (I/O) is often the bottleneck, not the “Engineer” (The Logic). By moving to Zero-Copy SHM and respecting NUMA topology, we reclaimed over 200ms of tick time.