Can a GenServer state be too "big" and general application architecture

harmon25 · April 30, 2016, 2:05am

Hoping the old adage that no question is a dumb question holds true with this one…

When building an application that requires maintaining state, should that state be maintained by a single GenServer process? Or should it be split up across multiple processes?
At what point does the state maintained within a single GenServer become too big? Possible?
Does this just boil down to semantics and code organization?

I guess poor performance would be an indicator one might want to split up the state across more processes. Are their more defined best practices in this regard?

Do mnesia or ETS have a role to play in applications which have a lot of state to maintain, but do not require persistence?

Hope the answers to this question will not only help me better understand and build Elixir applications, but other beginners as well!

Thanks all

NobbZ · April 30, 2016, 8:33am

We have a saying in germany, that there are no dumb questions, but only dumb answers

I think this is very similar to a discussion we recently had in a programming exercise (imperative) about a global super “object”¹ to hold and maintain state or having multiple small distinct global variables. There hasn’t been a real conclusion, but after half an hour everyone decided to use multiple small states, to reduce levels of idirection.

I don’t know about best practices here, I do not have that much battle proof experience with BEAM, but depending on how your super state is structured I do fear, that it will massively influence garbage collection (and generation ;))

Haven’t used mnesia so far, but using :ets in my erlang applications all the time. Reads and writes are much faster than a GenServer keeping a Map as state. Also there are optimisations in :ets for concurrent reads and writes (even interleaved), which you can’t do with a corresponding GenServer and a Map.

¹: Super-sized-struct with accessors would be a better wording here, since we were using C.

sasajuric · April 30, 2016, 1:01pm

I wouldn’t say there is one true way. It really varies from case to case, depending on what you want to achieve.

The benefit of keeping the state in a single process is that you have strong consistency. At any point in time, there can be only one process accessing the state. On the flip side, since all the requests are serialized, that process may become a sequential bottleneck if used frequently by many different processes. Another important downside is that you lose the entire state when the process fails. The consequences of a single error might be larger than you want.

Splitting the state/responsibilities over multiple processes give you a reversal of the properties above. There is no strong consistency: you can’t read a frozen snapshot of states from multiple processes, nor can you atomically change states in them. On the upside, you have possible performance improvements, and better failure isolation: if one process crashes, the others are still running, so you get to keep most of your service.

Hence, I’d say that choosing one or the other depends on the problem at hand. The work that needs to be done atomically and consistently should reside in a single process. It’s better off splitting independent tasks into multiple processes, for better error isolation as well as performance and scalability. It’s of course never as simple or as black/white as this, but these are the general guidelines I’d recommend. Occasionally diverging from that path for good reasons is fine.

One point about this:

Does this just boil down to semantics and code organization?

Processes are not the tool for code organization, but rather the organization of runtime. If you need to organize the code, modules are the way to do it. Let’s say for example that the state of your process becomes quite complex, but you still want to keep it in the same process. Then, consider implementing the state in one or more separate, (usually) purely data-oriented modules, and have GenServer functions just use those modules.

harmon25 · April 30, 2016, 3:25pm

Great thoughtful responses, thanks @sasajuric & @NobbZ !

Going to take the advice and try to split up state when it makes sense.

ETS seems like a good solution for when things start to get out of hand with GenServers, going to try and keep as much as I can within processes for now, mainly for simplicity.

One thing I was not considering was more dynamic naming for “object” like processes, as is explained in this blog post. This will allow each “object” to maintain its own state.

Which brought to mind this article, The Most Object-Oriented Language?…

Aha! I was misunderstanding process naming, it is pretty common practice to name a GenServer process after the module which it is launched, which got me thinking that module == process, which is incorrect!
A GenServer module defines the API of a process, but as you mention during runtime a single module could actually make up multiple processes. Modules can also simply be groupings of functions used to transform data, and not a process at all.

Understanding this much better now, correct me if I am wrong though.

sasajuric · April 30, 2016, 6:01pm

Right, so a module is essentially just a bunch of functions, while a process is a sequential computation. A process can call arbitrary functions from many different modules. Similarly, a module (its functions) can be invoked from many different processes.

Those are really two orthogonal concepts. Modules are used to organize the code, while processes are used to run different tasks separately, and thus get benefits such as scalability and fault-tolerance. And that was the main point of my earlier statement. You shouldn’t reach for multiple processes to organize your code. If your code feels somehow complex, or you maybe want to extract some common abstractions, modules/functions would be the tool for the job.

rvirding · April 30, 2016, 10:28pm

ETS is very good for managing a large amount of state when you don’t require persistence. The thing to remember is that it is a datastore and not a database so it does not support transactions, or very,very,very limited transactions. Very limited. So if you need to control access to an ETS table then you need to wrap it with a process.

Depending on what you need a GenServer might the right way to go. Seeing you are keeping the data in an ETS table the Genserver process itself will not get very large. This is basically what Mnesia does except that it provides a very large set of features, for example transactions, replication, distribution, persistence. It gives you a distributed ACID database.

TANSTAAFL which means you have to decide what you need and what you are willing to pay for.

One thing to remember with ETS tables is that as they are stored “outside” all processes accessing them means copying data between the process and the table.

Robert

JEG2 · April 30, 2016, 10:45pm

Because I had to look that up, I though I would shared that it means, “There ain’t no such thing as a free lunch.”

gregvaughn · May 1, 2016, 2:15am

You had to look it up? Never read Heinlein? Hand in your nerd card.

natewallis · September 1, 2017, 8:07pm

I got here as I was about to start a similar post in relation to a game I am working on (phoenix based). I am currently prototyping my application and the first thing I am working on is adjusting my players health over time. To keep my explanation simple, I will break my programming task into the following sentence.

"The longer they spend in the channel, the more their health is affected."

I guess in my case the GenServer process is the Channel and I am using socket.assigns to keep a reference to my player. My players parameters (e.g their Health) are modified over time. And their status is pushed back down the socket to my frontend.

Coming from an OO background, I was immediately breaking everything up into processes. And my first line of thinking is that I would need a GenServer (call it MyApp.Player module) to represent my player (and link this to the phoenix channel). Then I would need another GenServer (call it MyApp.HealthModifier) which is linked to the relevant MyApp.Player process. Then the MyApp.HealthModifier process could deteoriate my MyApp.Players health by sending messages back with the amount their Health should be decremented by.

After reading Sasa Juric’s article “to spawn or not to spawn” (great article title by the way) - I started going in a different direction. I am still not sure if its the right direction, but time will tell.

I have now started treating the phoenix channel as the only GenServer that I will need (at this point in my applications infancy) and keeping track of how long they have spent in the channel in a socket.assigns variable. Then when it comes time to push the players state back to the front end. I can “modify” my player by letting them know how long they have been in the channel for and doing the appropriate calculations and rebinding to my socket.assigns reference.

Which is completely different to how I was going to approach the problem.

My only concern is I am now looking at adding other “modifiers” to the Player… e.g. I might have an MyApp.ArmorModifier that will increase the players armor over time and taking my second approach I would be serializing all this code as I would call the MyApp.HealthModifier and then call the MyApp.ArmorModifier (all based on the time spent in the channel). In a nutshell, my Player has a map of Modifiers and currently iterate that map and apply them one by one.

I am starting to wonder if it would be better to have processes that independently report back to the player as to how much their Health or Armor should be affected (and other processes for other modifiers down the track).

Maybe its too early in my applications lifecycle to be considering this and I should just keep punching it out the way I am until I hit a bottle neck! But I am also concerned about what will happen if the channel process fails and how I would bring back my Player state… Whereas I might be able to control that better with individual processes.

Let me know if this would be better in another thread… this ended up being longer than I planned, but I feel it is related.

gregvaughn · September 1, 2017, 9:16pm

I give 2 thumbs up to Sasa’s article you linked to (and would give more if I had more thumbs).

You want to think of processes as something new, something you’re not used to. They’re ActiveObjects (if you’ve come across that in your OO career). Processes should be used for each unit of concurrency in your system. If you try to think of them like objects, you are very likely to create bottlenecks for yourself and limit your ability to scale. Normally I think of one player as one unit of concurrency. If, however, you were to need to refresh some particular modifier at different intervals than the others, then a case could be made for splitting it out. Otherwise, start out with one process per player, and potentially other utility processes.

You can always create different modules that know how to operate on different subsets of the state, and call those appropriately from your GenServer. This keeps your code organized by feature, but preserves your scaling potential.

natewallis · September 2, 2017, 12:39am

Great, thanks Greg.

I think I will stick with my current route for now (in a similar vein to Sasa’s article) and see if that works out for me. I figure that way I still get the concurrency benefits of Elixir as each Player is run inside its own channel process without spawning processes willy nilly (as per my first approach)…

I hope I am visualising that correctly… Based on my observations of my Phoenix app in :observer, I believe I am…