Why is this system (running only a thousand simple GenServers) slow?

Qqwy · June 4, 2017, 8:53am

Recently, a colleague made a simple ‘online game’ where a player state (consisting of a money counter, the amounts of three different resources, and an amount of miners where more miners means more resources excavated) would be updated every-so-often and shown to the player in the browser, and the player could interact with this state (buy more miners, sell all of a resource) with the goal to reach as much money as possible.

Its a very simple concept, and the proof-of-concept implementation my colleague wrote in Ruby on Rails runs a CRON task to update the player state that is always kept in the database, and AJAX-calls to buy miners/sell resources/ask for updates.

The idea is that I re-wrote this concept in Elixir, to show off what Elixir is good at.

The Source Code can be found here.

Important design decisions:

All communication is done over Websockets + Phoenix Channels.
Player states are maintained each in a GenServer, which sends itself a ‘tick’ message every couple of seconds (rather than using a background job).
Persistence is only done when required (the database is not ‘part of the loop’). In fact, there is no database in use right now: Per the Dependency Inversion principle, I wrote a Persistence behaviour and implemented it for a very simple FileSytem storage. Loading only happens on application startup, and writing only happens when a player performs a game action.
Updates to player states are broadcasted using Phoenix.PubSub, which are received by the Phoenix Channel and then forwarded to the Browser.
The Game is therefore completely separated from the Web-layer.

In its core, I feel it works great, is readable and extendable. However, when I tried starting 1000 players, my computer started to whine. System performance degrades gracefully, all my eight CPU cores are at 100% usage, but I am a bit flustered about how running 1000 of these PlayerServer GenServers brings my computer to its knees.

Therefore, I expect that I am doing something in a very sub-optimal way from an Actor-Model perspective. I have no idea how to properly introspect it, however. I tried running :observer (which is also rather unresponsive while running the 1000 player servers at the same time.) and I see that IO is not a problem, and that indeed, all my cores are getting used, and it seems no messages are stuck in message queues anywhere. But for why it can only handle so little players: No clue.

Help is greatly appreciated!

voughtdq · June 4, 2017, 8:59am

(this is just a guess)

It’s Timex.

outlog · June 4, 2017, 9:06am

yeah, remove the timex stuff.

I’m not sure how much File.io you are doing - but perhaps add an alternative :ets persistence layer, just to see.

Qqwy · June 4, 2017, 9:07am

@outlog The one thing I was able to read from Observer’s summary page, was the fact that there was barely any file IO happening. The only time the Persistence layer is used, is when the servers start, and after that only when a player makes a change. (So not!! every update).

I will try out what happens when not using Timex .

outlog · June 4, 2017, 10:46am

It’s only the first “tick” that is slow as you are replaying all ticks from last_player_tick_datetime up until now (in tick_until_updated(state)) … those stamps are from around 2017-05-31 19:43:27.753034Z - so that makes for a lot of operations times 1000 (~350 million “ticks” - which have sub ops) - and then if timex is a bit on the slow side…

Quick and dirty fix in Game.PlayerServer - persist the data after they have been replayed up until now - so future launches will be faster. by your design the launch will replay ticks - which I suppose is fine.
Else you have to update the last_player_tick_datetime to now when loading the player.

add:

  defhandleinfo :first_tick!, state: state do
    updated_state = tick_until_updated(state)
    GamePersistence.Persistence.persist_player(updated_state)
    send_next_tick()
    broadcast_update(updated_state)
    new_state(updated_state)
  end

and then in the init replace send_next_tick() with :erlang.send_after(1, self(), :first_tick!)

Then wait out a first launch and subsequent ones will be faster.

sasajuric · June 4, 2017, 11:04am

Given that CPU usage is 100% and you don’t observe I/O load, then it’s possible that your work is spent in each PlayerServer.

You could verify this by experimentally finding a smaller number of player servers which puts your CPU below 100% (say at 90% or so). Then you would have enough CPU available to work with tools such as observer. If in the processes tab you constantly see your player servers at the top, it should be a proof that these are the processes consuming your CPU.

Going further, you could use eprof to get some pointers about where you spend most of your time. This SO answer by Fred give some quickstart pointers.

If you’re able to find a sequential piece of code which causes your problem, you can drill into it further with the fprof Task.

IME reading the output of these profilers will usually require some meditation, so don’t be surprised if you’re not immediately able to find the cause. But most often you should be able to get to the root cause of your bottlenecks, or at least narrow down the problematic area.

Combining these techniques with some cheap trickery, such as commenting or stubbing out suspicious pieces of the code should help you find the problematic parts of your code.

outlog · June 4, 2017, 11:57am

I didn’t wait for the replays but did a single launch with the map put line in effect(and the first_tick changes above):

  def load_player(user_id) do
    case File.read(filepath(user_id)) do
      {:ok, data} ->
        user_state = :erlang.binary_to_term(data)
        #user_state = Map.put(user_state, :last_player_tick_datetime, Timex.now())
        {:ok, user_state}
      _ ->
        :error
    end
  end

then added 10000 players and it’s running fine at 16% cpu…

Qqwy · June 4, 2017, 4:18pm

@outlog This change was indeed all that was needed; Depending on how long it was ago that the application was started for the last time, CPU load will stay at 100% for a while. But after a few minutes the system will have catched up with the current time. Ensuring that this does not need to be recomputed again is definitely a smart idea.

I am absolutely amazed. Running the system now with 10_000 players, and after the initial updates are done, CPU load drops to much lower levels. I expect that I could easily run twenty or thirty thousand players (and possibly even more if the mix environment would be set to production, and deployed to an otherwise empty virtual private server.) I am extremely happy! This is definitely Fast Enough .

@sasajuric Thank you for your very detailed description on how to profile a system. For good measure, I have run eprof to check what is going on during starting up, and indeed all the time is spent during the starting procedure of the PlayerServers.

outlog · June 4, 2017, 4:36pm

great, you still need to add a “now and then” write to persistence function - so the persisted data doesn’t drift that far behind as the server runs… (and refactor my “fixes” when you are at it)

I would look at not using Timex, but hey if it works and is fast enough;-)