Anyone pushed the BEAM for real-time simulation workloads? I built a multiplayer game server framework pushing 10,000 Entities @ 30Hz

I’ve been building an open-source multiplayer game server framework in Elixir, and I wanted to share the results.

The Challenge: Can the BEAM handle MMO-scale real-time simulation? The conventional wisdom says “use C++ for game servers.” I wanted to prove otherwise.

The Results:

Metric Target Achieved
Tick time (10K entities) < 29ms 8ms avg, 13ms max
P99 tick time < 29ms 11.1ms
Bytes per entity < 20 18 (template bitpacking)
Per-player bandwidth < 1 MB/s 264 KB/s
Full broadcast bandwidth < 100 MB/s 5.2 MB/s
Crash recovery < 100ms 4.6ms

Key Techniques:

  • Bucket-parallel tick loop — Chunk entities across schedulers, not Task-per-entity

  • Fused encoding — Tick + serialize in one pass while data is hot in L1 cache

  • :maps.from_list — 9x faster than Map.put in a reduce (TIL!)

  • IO lists everywhere — Never concatenate binaries, let writev gather fragments

  • Time-travel debugging — Circular buffer with structural sharing for replay

  • Cross-zone shadows — PubSub-based visibility across zone boundaries

  • Async persistence — Snapshot + journal with atomic writes

The “aha moment”: Killing the 68ms sequential serialization bottleneck. Fused parallel encoding brought it down to 2ms for a typical player’s visible area (500 entities).

Architecture highlights:

  • Hybrid entity model (players as processes, NPCs as data in zone state)

  • ETS spatial grid for O(1) neighbor queries

  • Binary protocol with reliable_seq gap detection

  • Dead reckoning for shadow entity interpolation

  • Hysteresis on zone boundaries to prevent oscillation

Test suite: 340 tests, 0 failures

Still early days, but the engine handles the “10K goblin stress test” without breaking a sweat. Next up: TypeScript client SDK and a simple LiveView visualizer.

Curious if others have pushed the BEAM for real-time simulation workloads. What’s your experience?

30 Likes

Hey, this is interesting as I didn’t see too many benchmarks of beam for performance, but what exactly is being tested here? I can see the metrics but don’t exactly understand the test scenario.
Is it like simulation of ‘10k players in the area’? If so then how many actions are they taking on each tick? I’ve read the comment 3 times and don’t see that info.
It’s interesting how scaling of that solution could look like (I mean how changes/effort would you have to make to have 10k simulations of 10k entities) and how much could you delegate to Zig/Rust if need be.

1 Like

Hello @nxy7

Yes, you are right the testing ground is not well explained, let me complete the informations below:

The test scenario:

  • 10,000 NPC entities (not players) in a single zone

  • Each entity runs behaviors every tick (wander, chase, spatial awareness)

  • Each tick: query neighbors, update position, serialize state

  • 30 ticks per second

Player model:

  • Tested with 100 simulated players

  • Each player has AOI (Area of Interest) ~500 visible entities

  • Per-player delta snapshot: ~2ms

  • Full 10K broadcast (worst case): 22ms

What entities DO per tick:

1. Spatial query: "Who's near me?" (ETS grid lookup)
2. Behavior: Decide action (chase/flee/wander)
3. Update: New position/velocity
4. Encode: Serialize to binary (18 bytes)

Scaling to 10K × 10K:

Split zones. Each zone handles 10K. Shadows (PubSub) connect boundaries. Topology scales horizontally—add nodes, add zones.

Zig/Rust sidecar:

Architecture supports it via Ports. Heavy math (pathfinding, UMAP, physics) can offload. BEAM handles state + networking, sidecar handles computation.

Best,

8 Likes

Nice! I’ve been (slowly) working on a World of Warcraft server, Thistle Tea, that has some similarities.

In Thistle Tea, players and NPCs are processes, but I’ve abstracted the interface to use GUIDs. The idea is that the boundary process layer will be straightforward enough to swap for something that groups entities by zone instead if it ever becomes necessary, without needing to change (much) game logic code.

IO lists are a good idea, they’ve been on my list to benchmark sometime but haven’t got around to it. Right now building packets is just binary concatenation. Bandwidth measurements are a smart idea, I have latency metrics but bandwidth would helpful to understand the entire system performance.

I’m also using ETS for a spatial grid and it’s been working really well. Recently also started using ETS for per entity metadata that other processes need access to in hot loops, that seems to work decent to avoid the message passing overhead where necessary.

I’ve mostly been focusing on building out functionality, recently got a rudimentary NPC AI wired up using behavior trees to get the basics of combat working. Stuff’s tricky because the client really likes crashing if packets are even slightly malformed, and some of the expected packet layouts aren’t very clear, lol.

It uses a Rust NIF for pathfinding, which would’ve been a pain to implement from scratch. Still a bit of wonkiness there for me to debug sometime, though. Mobs like to stutter around a bit when repathing, which makes me think the server isn’t perfectly simulating movement the same way the client is (or some other bug).

Here are a few blog posts I wrote about the process if you’re interested.

4 Likes

Hey @pikdum, thanks for the detailed response!

I saw you are already down the rabbit hole :slight_smile:

I spent some time reading your devlogs on pikdum.dev—the work you’re doing with Recast/Detour is top-tier.

I noticed you mentioned ‘wonkiness’ and ‘stuttering’ during repathing with the Rust NIF.

We actually agonized over the NIF vs. Sidecar decision for Phase 4 of Realm. We ultimately chose a Sidecar-via-Unix-Domain-Socket approach, and I wanted to share some ‘honest’ statistics that might be interesting for your consolidation move.

The ‘No-Stutter’ SIGKILL Test:
We were scared that the marshalling tax of a Sidecar would kill our 30Hz budget, but we prioritized Fault Isolation. We just ran a ‘Zombie Recovery’ test with 10,000 entities. We executed a SIGKILL on the Rust Sidecar mid-compute.

  • Result: The Elixir Zone detected the failure and fell back to our internal Grid-based collision in 31.6ms.

  • The Win: The world didn’t stutter. The tick stayed within the 33.3ms budget, and the players never knew the ‘brain’ had died.

The ‘Profitability’ Crossover:
We also found a fascinating crossover point in our ‘Apples-to-Apples’ benchmark between pure Elixir (Optimized Grid) and the Rust Sidecar (UDS):

  • < 500 entities: Pure Elixir (O(N²)) is faster.

  • 500 - 5,000 entities: Elixir’s Grid-based O(N) is surprisingly competitive (~1.8ms for 2k entities).

  • > 5,000 entities: This is where the Sidecar’s parallel rayon compute finally pays for the UDS marshalling tax.

Architecture Move:
To kill the serial bottleneck of the Port, we’re using a Dynamic Socket Pool (24 sockets for 24 schedulers) to maintain scheduler affinity. It keeps the L2 cache hot and the marshalling under 2ms for 10k entities.

Curious if you’ve measured the ‘Tail Latency’ of your NIF? We were worried that a long-running NIF call might be causing the stuttering you see by starving the BEAM schedulers, even with Dirty Schedulers.

Would love to compare notes on your Marshalling overhead vs. ours!

Best,

3 Likes

Thank you for sharing your experience and findings. I’m interested to hear why you went with a plain sidecar versus a C node? If you even considered the latter, of course. C nodes give some benefits regarding integration. You can treat your Rust/Zig process as a BEAM node. It can create process identifiers, register global processes and communicate with other processes. It’s all asynchronous using file descriptors. My experience with C nodes is related to sandboxing, not simulations and that’s why it’s interesting to hear other opinions on the topic.

1 Like

@krasenyp

Thanks for the question! C nodes were definitely on the table — they’re the proven OTP approach and the integration benefits are real.

In our case the bottleneck was raw throughput: 10K entities at 30Hz means we’re pushing binary blobs every 33ms. The Erlang distribution protocol adds term encoding overhead that was hard to justify at that frequency.

We ended up with a pool of 24 Unix Domain Sockets (one per scheduler) sending raw binaries with pre-allocated buffers. It’s simpler but it’s fast enough for the workload.

That said, for lower-frequency scenarios or multi-machine setups, C nodes make a lot of sense. Curious about your sandboxing experience — did you run into any latency constraints with the distributed protocol?

Best,

2° Round - lessons from the metal: optimizing 10,000 entities on a DL380 (Xeon vs i7)

The Setup
After my last post about 10k entities on a modern i7, I moved the project to a refurbished HP DL380 Gen9 2014 (Dual Xeon E5-2650 v3 @ 2.30GHz). The goal was to see how the BEAM handles high-frequency simulation on older, high-core-count server hardware using Docker.

The “Xeon Wall”
The first run in Docker was a failure. Code that ran in 11ms on my 5GHz desktop workstation hit 240ms+ on the server. The lower single-core clock speed (2.3GHz) exposed serial bottlenecks that were previously hidden:

  1. The QPI/NUMA Gap: Docker was straddling both physical CPU sockets. Moving data between Socket 0 and Socket 1 (Distance 21) added massive jitter to the memory-heavy AI passes.

  2. The Socket Tax: Moving 320KB of entity data through Linux kernel UDS sockets every 33ms to our Rust physics sidecar was eating ~90ms in context-switching and marshalling.

The Fixes
We had to move from standard “Web” patterns to “Systems” patterns to get under the 33ms budget:

  • Shared Memory Slab: We replaced Unix Domain Sockets with a shared memory segment in /dev/shm. Elixir and Rust now communicate via a 64-byte aligned “Foundry Slab.” Marshalling time dropped from 88ms to < 1ms.

  • Docker NUMA Pinning: We used --cpuset-cpus and --cpuset-mems to lock the container to Node 0 (Cores 0-9 and their hyperthreads). This kept the BEAM schedulers and the memory local to the same physical silicon.

  • The N-1 Pipeline: We deferred the physics merge to the start of the next tick. The workers now “gulp” the previous physics result while they are already reading from ETS for the AI pass, reducing ETS write-pressure.

The Results

  • DL380 (Xeon E5-2650 v3): 22-24ms P99 (Solid 30Hz in Docker )

  • i7-13700 Workstation: 7-10ms P99 (Running the same “Foundry Slab” architecture)

Takeaway
Running high-performance Elixir in Docker is totally viable, but hardware is not a transparent layer. On older server metal, the “Postman” (I/O) is often the bottleneck, not the “Engineer” (The Logic). By moving to Zero-Copy SHM and respecting NUMA topology, we reclaimed over 200ms of tick time.

5 Likes

Super interested to read the code, if you are open-sourcing it.

I remember this being a thing 20+ years with the Java debugging protocol btw.

2 Likes

Do you have source available anywhere? I would be interested in taking a peek.

4 Likes

Sounds cool! Would love to follow and learn as well

1 Like

Hi @dimitarvp still not released, i’m doing some refactoring to reduce the complexity of some modules, I will keep you posted

@hauleth still not released, i’m doing some refactoring to reduce the complexity of some modules, I will keep you posted

2 Likes