PartitionSupervisor architecture/ optimizations

Hey folks! :waving_hand:

I think I have an interesting optimization challenge, let me know what you think…

A few months ago, I created my own ECS (Entity Component System) implementation based on a different interpretation of the pattern (because why not!?). Here’s the link to the repo in case you are curious. It’s ok if you haven’t heard of the ECS pattern before. I will try to describe the problem more generically:

I have a list of “objects” (which are just numeric identifiers), and those “objects” have “behaviors” attached to them. The only thing that those “behaviors” do is handle events sent to “objects” they are attached to. A really basic example

defmodule Ping do
  def handle_event(:play, object, _args), 
    do: IO.puts("Ping for #{object}")
end

Those “behaviors” are just a module that implements a handle_event callback, not a GenServer. To make things easier to visualize, imagine you have two “behaviors”: Ping and Pong:

object1 = System.unique_integer([:positive])  # 1314
object2 = System.unique_integer([:positive])  # 1416

Ping.attach(object1)
Pong.attach(object1)

Ping.attach(object2)
Pong.attach(object2)

World.send(object1, :play) 
# Ping for 1314
# Pong for 1314

World.send(object2, :play) 
# Ping for 1416
# Pong for 1416

There are a few interesting properties about this problem:

  • A “behavior” can also send additional events

  • The handling event logic is synchronous and can potentially block.

  • Events sent to the same “object” should always be processed in order:

    Enum.each([Ping, Pong, Other], &(&1.handle_event(object, event, args))
    
  • If you send another event to the same “object”, it should be queued up and processed after the current event finishes being processed by all its “behaviors”:

    (object1) :play -> [Ping] -> [Pong] -> :forfit -> [Other]
    
  • Events sent to different “objects” can be parallelized as they don’t affect each other:

    (object1) :play   -> [Ping] -> [Pong]
    (object2) :play   -> [Ping] -> [Pong]
    (object3) :forfit -> [Other]
    

The current solution: There’s a single GenServer called World that is responsible for dispatching messages to registered objects. This GenServer uses a PartitionSupervisor, which ensures that “objects” are correctly “sharded”, causing events for the same object to always be processed sequentially in the same places, while events sent to other “objects” are processed in parallel:

[World] 
| ---(send :play)-----> [Router 1] (object1, Ping -> Pong)
|----(send :play)-----> [Router 2] (object2, Ping -> Pong)
|----(send :forfit)---> [Router 3] (object3, Ping -> Pong -> Other)

The challenge with the current solution is that we only have a fixed number of partitions, and those processes can quickly become a bottleneck if, let’s say, you have something like multiple LiveViews sending messages at the same time, which increases the load, the process reductions increase, and they slow to a halt. I could potentially work on resizing the partitions, but that would require some additional logic to check when those processed are being overwhelmed (which might be too late).

I briefly explored GenStage, which would change the topology to something like:

[World] 
| ---(send :play)-----> [Ping] -> [Pong] -> [Other]
|----(send :play)-----> [Ping] -> [Pong] -> [Other]
|----(send :forfit)---> [Ping] -> [Pong] -> [Other]

This problem almost looks like the perfect use case for GenStage; however, there’s one additional aspect of this problem that makes things a little trickier: “behaviors” and events are completely dynamic! They can be attached/ removed from an object at any point in time, and this affects which stages need to execute.

So, even though the order of the stages is always consistent, not all objects need to go through all the stages, for instance, “object1” might need ot handle: Ping → Pong → Other, while “object2” needs to handle just: Pong → Other or even Ping → Other.

I haven’t given up on GenStage, yet, but I still have to figure out if a demand of 1 makes sense to keep an influx of events, and how to properly handle partitions (GenStage.PartitionDispatcher perhaps) that can grow/ shrink.


To finish this up, I think an ideal solution (in my mind at least) would be something like a dynamic pool of partitions that grow to fit the demand and shrink when there’s a lull in events. So:

  1. An object gets a new event
  2. Checks if there’s any available process to handle the event for this object
    2.1. If yes, send the event to be processed (potentially blocking)
    2.2 If not, spawns a new process and registers that process as the object’s “handler”
  3. Processes periodically send their queue length somewhere so we can decide if the demand is too high and needs the pool to grow or shrink (because we need to ensure that processes are always processed by the same partition, we might need some logic to shed/ redirect messages between partitions if one particular object it taking too long).

This is all theoretical; I haven’t tried it yet. So, what are your thoughts? How would you solve this?

1 Like