Elixir enables stateful web applications, is it wrong to think like this?

qwerescape · December 5, 2017, 4:28am

The concurrency model of Elixir is really fascinating, along side immutability, they are my favourite things about Elixir. Recently I’ve been thinking about how Elixir can help me manage state in a distributed web application. I am looking for people to enlighten me to agree/disagree with my thinking.

In my opinion, stateful applications are a lot easier to reason about than a “stateless” application, I am putting “stateless” in quotes because I haven’t really seen a useful application that doesn’t have any side effects/states. I think the reason why people choose stateless because there is no good technology stack that allows them to do stateful safely. In my past experience with web applications in other languages, there are really 3 ways to manage state:

Client side state: basically the entire state is passed from front to back in every request, the problem with this is if I have a web front end and a mobile front end, they will end up overwriting each other. To synchronize them is difficult, and I’d say any technique can only shrink the window where the race condition happen instead of preventing it.
Server side state: this is very problematic in a clustered environment, you might end up with multiple states on different servers that all represent the same user. You can kind of solve it by sophisticated routing based on cookie/request param to make sure the same user (even with different devices) always end up on the same server. On top of that you need to make sure you have some thread safe data structure on the server side that can load/save the user state.
Database state, stateless server/client side: the most used case, unfortunately also the slowest one because in most cases it means a network hop. If combined with a stateless frontend + server side, you can get into race conditions: the husband trying to checkout on the website, but the wife is deleting the same cart on the mobile app; customer tries to add the same product on both web and mobile at the same time, but the business rule is that 1 custom can only buy 1. To solve those issues, we often resolve to adding database constraints, unfortunately any logic that we put in the database layer is not unit testable or easily understandable.

So all those problems, I feel like I can solve them by Elixir!
In an Elixir app, my mental model is processes interacting with each other, processes are globally addressable, and they don’t all have to live in the same server. So if I have:

one process per user to hold state
that process will periodically/asynchronously persist the state to the database just in the case
any front end request can hit any server in the cluster, but since processes are globally addressable, i can always route the message to the right process
if processes crashes, the supervisor will restart process with the last known state persisted.
since all the state is in memory, I’d imagine it will be very fast.

I am not very experienced with Elixir, please share what you think, am I missing something that will prevent this from working well?

Thanks

kelvinst · December 5, 2017, 2:02pm

In my opinion you’re totally right in the title!

Well, I do actually agree with this statement, because I think state itself is inevitable, there is no such fully “stateless” application. I guess what people might want to say when they say “stateless” is kind of a temporary stateful application, not relying deeply on a given state.

Yes, you’re right, but you also can solve with other things. Maybe Elixir makes it a little easier to solve because of the agent model, but let’s not forget there are other solutions for the same problems, and actually some of them are better in some aspects, like performance.

Still, if it’s a feature that one user can affect the other user state, you’ll have to make your own code to manage one user (process) updating the other user data. This can be very complex, and using a db you get it done quickly. I mean, it’s not a bad idea, I am just showing the trade-off’s, and maybe, for some applications, this is actually the best way to do it.

Again, more code for you. But maybe, it would be a good idea to make a PoC of it. I actually can be the birth of a new framework, why not?

This is actually a very nice idea!

Sure, but will be a little bit more expensive. It depends on the size of each user states and the simultaneous users of course.

Well, resuming it, I really like the idea, all I can do is to encourage you to prove the concept and turning it into a framework like plug somehow! I would also really like to be part of this if I have some time.

Thanks for sharing!

qwerescape · December 5, 2017, 2:23pm

thanks for the reply @kelvinst, all very good points, I will try to build on some of them.

Still, if it’s a feature that one user can affect the other user state, you’ll have to make your own code to manage one user (process) updating the other user data. This can be very complex, and using a db you get it done quickly. I mean, it’s not a bad idea, I am just showing the trade-off’s, and maybe, for some applications, this is actually the best way to do it.

I agree that it’s a trade off. The “states between user” part is interesting, I didn’t think of that. I am thinking since each user is a process, the users can communicate among themselves using message passing as well.

Again, more code for you. But maybe, it would be a good idea to make a PoC of it. I actually can be the birth of a new framework, why not?

I am hoping to use an existing library for a global process registry, the client side will have to identify itself through an ID, and the server side code will identify the process through that ID in that registry, so you are right, it is extra code.

Well, resuming it, I really like the idea, all I can do is to encourage you to prove the concept and turning it into a framework like plug somehow! I would also really like to be part of this if I have some time.

thank you for the encouragement! I am thinking of building an ecommerce platform with this idea, I feel the ecommerce platform that I am working with right now has quite a few of those problems. I am open to suggestions.

Thanks

JEG2 · December 5, 2017, 2:34pm

qwerescape:

So all those problems, I feel like I can solve them by Elixir!
In an Elixir app, my mental model is processes interacting with each other, processes are globally addressable, and they don’t all have to live in the same server. So if I have:

one process per user to hold state

that process will periodically/asynchronously persist the state to the database just in the case

any front end request can hit any server in the cluster, but since processes are globally addressable, i can always route the message to the right process

if processes crashes, the supervisor will restart process with the last known state persisted.

since all the state is in memory, I’d imagine it will be very fast.

I am not very experienced with Elixir, please share what you think, am I missing something that will prevent this from working well?

I think you’ve got some great ideas. I gave a talk about some of this a while back that you may enjoy.

josevalim · December 5, 2017, 3:22pm

Orleans is a framework for .NET that provides that out of the box. The Elixir ecosystem has all of the building blocks but nothing that provides the full experience (yet).

Here is another good talk from Caitie McCaffrey covering some of those patterns: https://www.youtube.com/watch?v=H0i_bXKwujQ

kelvinst · December 5, 2017, 3:23pm

Is this yet another encouragement or you mean is there a work in progress on this?

josevalim · December 5, 2017, 3:24pm

Haha, it is an encouragement.

peerreynders · December 5, 2017, 3:37pm

This reminds me a bit of the argument that immutability is pointless because you have to change something somewhere to have an effect.

The way you get something that is “easier to reason about” is by pushing side effects or in this case state to the edge of the system.

So the issue with state isn’t that needs to be completely eliminated but it needs to be reduced to the essential minimum and it needs be located where it belongs.

Also there are many kinds of “state”. For example REST makes a distinction between application state which lives on the client and resource state which lives on the server (so domain state lives on the server while session state resides largely with the client (some concepts like transactions are modelled as resources)).

one process per user to hold state

Somehow this smells like mapping OO onto processes, i.e. use processes as containers of state. Have you looked at To spawn or not to spawn yet?

“Goto” is a useful tool but people abused it because it was easy to do - so it was abolished and replaced by the more constrained concepts of for and while loops and if conditional. When it comes to state we can’t abolish it because it is essential to what we need to accomplish - but we need to be very disciplined about how and where we use state.

Ultimately I find a process “easier to reason about” when it’s behaviour does not depend on internal state and this applies even more so to a group of cooperating processes. State has to live somewhere but that is no excuse to let it appear everywhere.

One has to be careful to not confuse “this is easy to accomplish with state” with actually being “easier to reason about”. In my experience systems embracing immutability and statelessness for the most part are in fact “easier to reason about” even if they require a bit more code.

michalmuskala · December 5, 2017, 3:46pm

We have GitHub - erleans/erleans: Erlang Orleans which is very promising, I’m not sure, though, how ready for production it is.

qwerescape · December 5, 2017, 10:20pm

@peerreynders thanks for your reply. I actually agree to almost everything you said about immutability, pushing side effects to the edge of the system, keep state logic small and contained… There are certain places I feel you generalized my statements in ways that I didn’t intend.

You touched on a great point of “easy vs simple” with goto, and I do want to discuss why I think it’s simpler to reason a system in which processes hold state. I believe systems should handle states in the right context, and absolutely not let the state leak, an example is a pure function: it could introduces variables that hold state, but those states are local; and I think another good example of this is gen_server

I want to take an analogy to explain my original thinking of “stateful” vs “stateless”, hopefully it could explain my thinking better and spark more discussions.

Take the example of an analog booth that sells analog (paper) movie tickets. There is 1 agent sitting in the booth with 100 tickets in his drawer, and there is a line of people waiting outside the booth. The agent serves the customers 1 by 1, by taking the money and handing out the tickets. A while later the agent realizes that he is low on tickets, so he asks his supervisor for more, and his supervisor goes to the ticket vault, fetches 50 more tickets and gives it to the agent. As the booth gets busier, the supervisor calls another agent over, opens another window and hands the new agent 100 tickets to sell, since there are 2 windows now, customers are getting served faster. Eventually every agent sells out their tickets, and when they ask their supervisor for more tickets, the supervisor checks and vault and says “everything sold out! congrats, go home”, the agents all go home. (or maybe leave one agent behind to let future customers know that we are sold out)

I think the parallels that we can draw between that example and OTP is clear. Now the version of that example with stateless backend that only persist state in the database is something like this:

There is a ticket booth that doesn’t have any mechanism for people to line up, but the supervisor has a pool of agents, for every customer that shows up, the supervisor asks an agent in the pool to deal with that customer. The agent doesn’t know anything about the tickets (how many is left in the vault, how many he is allowed to sell), so he runs over to the ticket vault, grabs the tickets if available, runs back to the booth, hands the customer the tickets. Situation gets sticky when 2 agents arrive at the vault at the same time but only finds 1 ticket left; or when the tickets are sold out, the agents wouldn’t know because they are not allowed to remember anything about the ticket vault.

In my opinion, the first scenario is a lot simpler to reason about. Even though the agents hold state, it is a local state that is completely opaque from the rest of the system, and in a way I’d consider the agents to be the edge of the system because the ticket vault is another system.

Thanks

qwerescape · December 6, 2017, 4:12am

@JEG2 amazing talk, that is almost exactly what I had in mind and you put it so eloquently.

peerreynders · December 6, 2017, 2:56pm

Scenario A: Ticket Agent with a 100 ticket stash
Scenario B: Ticket Agent fetches Customer order from the Ticket Vault.

What puzzles me is that you don’t realize after delivering the narratives that Scenario A is in fact much more complicated and therefore will be much more difficult to reason about because it has many more possible system states (and edge cases).
####Ticket Agents:

Scenario B:

Just has to fetch tickets for one Customer from the Ticket Vault.
May have to “line up” at the Ticket Vault. In a message based system this is a non-issue as the Ticket Vault can be modelled by a single process and all the ticket requests can be served in order. But it has to be acknowledged that the Ticket Vault can become a bottleneck and that the Ticket Agent is blocked while it is waiting for the requested tickets.

Scenario A:

Ticket Agent has to manage its own ticket stash. It has to consider additional actions (request for more tickets) based on the fill level of the stash.
While less likely there is still the possibility (edge case) that more than one Ticket Agent needs the Agent Supervisor to fetch tickets, so the possible bottleneck has shifted from the Ticket Vault to the Ticket Supervisor. An asynchronous/stash approach would make it less likely that a Ticket Agent runs out of tickets before the Agent Supervisor resupplies it - but it can still happen and therefore needs to be accounted for regardless. So just like for the Scenario B Ticket Agent this Ticket Agent also has a (possible) “wait for tickets to become available” state.

###agent Supervisor:

Scenario B:

One responsibility: Assigning a Ticket Agent to a Customer.

Scenario A:

Has multiple responsibilities 1.) fetching tickets when asked 2.) deciding whether to deploy (or recover) Ticket Agents

####Customer:

Scenario B:

Deals with the Ticket Agent assigned. When that Ticket Agent responds “there aren’t enough tickets” that is the end of it.

Scenario A:

Has to decide which Ticket Agent to access, provided there is more than one. Also there is the possibiltiy that one Ticket Agent has to serve most of the requests while others sit idle.
A Customer can be lined up at an agent who runs out of tickets when the vault is empty. Meanwhile another Ticket Agent may still have some tickets left. Therefore the customer needs to be prepared to line up multiple times. What if at the end the agent doesn’t have enough tickets? Does the customer buy the ones that are available hoping to get the remaining tickets through the remaining agents OR does the customer abort the transaction and try the other agents for the complete quantity? Decisions, decisions, decisions…

And ultimately the comparison focuses on the wrong details. What is important is that the interface between the Customer and Ticket Service is specified in such a way that the Service can be run by a single agent or by an army of agents - without the Customer knowing the difference. In a message-based system that is easily accomplished by sending the initial request to a “known name” identifying the service while all follow-up negotiations are handled via the “reply-to name (PID)” specified on each response to the Customer. That way it doesn’t matter whether the Customer:

deals which a different Agent after the initial contact (request)
deals with different Agents (each specialized on a particular aspect) throughout the entire negotiation process to complete the transaction

Once you adopt the unified interface Scenario A becomes simply an optimization of Scenario B. That means that the state that is taken on in Scenario A is non-essential - it’s voluntarily accepted complexity in the hopes of increasing concurrency internal to the Ticket Service.

But in the end that “optimization” adds complexity as it increases the potential number of internal states that the Ticket Service can take on as a whole - so in the end the internals of the Ticket Service become more difficult to reason about under Scenario A. Meanwhile it’s the responsibility of the Ticket Service interface to keep everything nice and simple for the Ticket Customer who doesn’t care whether the Ticket Service is internally organized according to Scenario A or Scenario B.

In a way your narrative showed more concern about “how something is accomplished” rather than “what needs to be accomplished” - which I associate “imperative problem solving” rather than “declarative (functional) problem solving”. So while non-essential state can be a legitimate optimization tactic (think cache), it also often seems to be a by-product of “imperative problem solving”.

###Stateful vs. Stateless
Now when it comes to stateful vs. stateless - lets imagine that our Ticket Customers need to provide a “billing address”:

####Stateless approach:

Simple: Customer provides billing address during the transaction of purchasing tickets
Con: Customer needs to provide billing address for every purchase even when it hasn’t changed the last ten times…

####Stateful approach:

Simple: Customer with a Customer ID doesn’t have to supply it as it is available via “State”

Cons

State needs to be stored somewhere where it can be found (because pertinent information isn’t submitted with the rest of the transaction/message)
State needs to be created - i.e customer needs an ID before making a purchase.
State needs to be maintained (i.e. the infrastructure for changing it needs to exist) i.e. Customer moves, so the billing address needs be updated before the next purchase, or it doesn’t get updated so it will be incorrect, or Customer remembers that it needs be updated mid-purchase (i.e. sharing and consistency of state between Customer update and ticket purchase).

So state needs to be managed and that adds complexity. However local state is convenient when it is co-located with the decision making logic (justification for Object/Class) but local state becomes problematic when it needs to be shared or it’s value has some other non-local consequences.

qwerescape · December 6, 2017, 4:26pm

@peerreynders your imperative vs delarative programming statement in the end actually clarified something that had been fuzzy for me for a while, thank you for that.

I see the value in your stateless arguments and I agree with them, however I still feel that the stateful server version is simpler to reason about, and it might be because I have a different (wrong) definition of simpler? So far in your posts, you haven’t acknowledged at all any benefits of the stateful approach, seems like you are seeing something that is just fundamentally wrong with this approach, and I am still failing to see the same thing.

To me, a big part of “simpler to reason about” is autonomy, share nothing, free of dependencies, able to perform a task end to end, able to handle its own errors. When it absolutely needs dependencies it should favour “shout” over “ask” like “I just did this, and this is the result” vs “hey, my dependency, please take this result and do thing X with it”.

So with that definition, I feel scenario A is simpler than scenario B because in scenario A: there is no dependency between the ticket agent and the vault, the ticket agents don’t know there exists a vault. Granted the supervisor knows about the agents as well as the vault, but that’s the supervisor’s job to manage the dependencies! If the vault is down for whatever reason, the supervisor can probably fetch tickets from a backup vault, the agents are oblivious of the change. In scenario B, the ticket agents know that there is a vault, and they need to be told what happens when the vault goes down and etc.

I do see an argument there that in scenario B there doesn’t have to be a dependency between the ticket agents and the vault, there could be a ticket distributor that holds the key of the vault, and the agents line up in front of him to get the tickets. I see the benefit that state is centralized, but at the cost of way more interactions between dependencies. The comparison is the agents come to the distributor once in a while to get a batch of tickets vs the agents come to the distributor every time there is a customer purchase. I see why you’d think that the former is an optimization (cache) of the latter, honestly my gut feeling is to pick minimal dependency over more states.

When an interaction requires context, like a chat bot (that sells ticket? lol), if the bot doesn’t remember context, the user will have to pass the entire conversation history every time, that doesn’t seem right. If the chat bot keeps a log of their chat history and reads it very quickly every time to build up context, should we provide an external storage for the chat bot to store that log just for the sake of keeping the bot stateless? I really am not sure.

Thanks

amnu3387 · December 6, 2017, 4:56pm

What I understand (from my limited experience and without having had enough time to apply to more than a few problems), state can be contained to its particular process, which is actually the interesting part right? It’s not like my monitor needs to know of any logic to handle tickets, probably I can even get away without a monitor, if the ticket handler process has that logic contained in itself, including how to build itself?

Because in other languages you would need to have sort of a global state (instance variables in ruby, globals in JS, database access, etc), but in erlang, and so in Elixir, you can for instance, have individual gen_servers (that can be implemented in other languages - although without the guarantees the BEAM gives) where each one knows about their own state. You’ll still have bottlenecks (or possible bottlenecks) but at the same time, you don’t need to add complexity on all layers to have access to this state?

I can for instance, route a request with some sort of ID, that I previously used to initiate and register the genserver process (and I mean multiple genservs, not just the central one kind), and just cast to it, and have the logic for handling the requests live inside the genserver itself, as callbacks/handle_in’s. Just giving it the message and saying, reply back when you’re ready. Now in a http request type system, this would need to have some synchronous guarantees right? But in here we can have multiple non-blocking processes handling the requests and more, it’s fairly trivial to implement websockets (or any socket communication) so that the Genserver can then give its “reply” when it’s actually ready to. You don’t actually need anything more than an interface to dispatch requests to their appropriate processes (Existing or yet-to-be), and you get strong guarantees in regards to order of execution (“mailbox”) and consistency of the state within the process itself (due to the guarantees regarding the order of exec, as long as it is only the process itself that mutates its own state).

I’m not sure though if it maps correctly and beneficially to several, few, any, or none (I think none is out of question because I’ve applied it at least in two different occasions and I think it maps beautifully to those 2), of existing domain problems (also because in many problems you need guarantees and interaction between different parts that can’t be atomised into their singular process), but it sure does offer plenty of ways of thinking about how to set up request-response systems.

For instance, one problem, yesterday I was making parallel uploads in a rails app. I wanted to prevent more than X uploads, the problem is when paralleled, I would get always one more upload, than the limit I had set, effectively accepted. And it’s not so trivial to solve this. In elixir I could just fire a Genserver when taking in the requests, registering it with the ID of the user, then casting the requests to process the upload on it. I would be able to easily guarantee that no more than X uploads would be accepted - fairly trivial. Since I have guarantees with “whereis”, and “already_started” a few lines of code would allow me to orchestrate the init of it (with the number of existing uploads) while being assured that all requests would be handled in light of this number and that each single request would include increasing this number before the next request was to be processed.

christhekeele · December 6, 2017, 5:01pm

HTTP already has an idiom for state: Sessions. And Elixir has an idiom for process-based registration: Registry. And Plug has the perfect tool to build the intermediary: a custom Plug.Session.Store behaviour impl. Having implemented one before, I can tell you it’d take maybe 20 lines of code to wire them together to get started, would love to see the result!

peerreynders · December 6, 2017, 8:43pm

Autonomy is important but I suspect that you are primarily talking about runtime autonomy - design/maintenance time autonomy can be even more important. I suspect that your view on dependencies needs a slightly more measured approach.

Ticket Customer should absolutely minimize any dependencies on the Ticket Service, i.e. it should be loosely coupled. Meanwhile Ticket Agent, Agent Supervisor, and Ticket Vault are working together towards implementing the responsibilities of the Ticket Service - that is their job, so they need to be interdependent to work toward their common goal - “dispensing tickets in accordance with the rules of the service”. So within the boundary of the Ticket Service those three are subject to high cohesion and high coupling because they need to share certain details about the “dispensing business” that are “nobody’s business” outside of that boundary. As long as the Ticket Customer remains oblivious to these “business details”, Ticket Service can change the “internal business practices” with impunity - e.g. switch implementations from Scenario B to Scenario A or vice versa.

Also I’m not arguing against autonomy over state, as I said before, state is unavoidable but when it appears it is worth scrutinizing whether it is necessary and whether it appeared in the right place.

Ultimately I was responding to this:

one process per user to hold state

You seem to be more concerned about where “state” goes rather than “what you are trying to accomplish”.

State shouldn’t be the primary design concern - are you “getting done”, what needs to be done? - that usually is accomplished by dividing up the responsibilities (not state). A message-based system works by moving data (events) from process to process - that message data and its movement is what is important.

Some processes will have state as a result of their responsibility and on the most general level a process is a message processor first and a state container second (and only if absolutely necessary). A “user” is a concept that may entail many responsibilities - so those responsibilities could well be spread across multiple processes - some of them possibly handling multiple or even all users if that is what is necessary to fulfill that particular responsibility.

It would be a mistake to select a single process as a locus of state and then aggregate all the responsibilities that need access to that state into that process like this:

one process per user to hold state

that process will periodically/asynchronously persist the state to the database just in the case

I see that and I see a process version of Active Record. Mixing responsibilities was a bad idea with objects and still is a bad idea with lightweight processes.

if the bot doesn’t remember context, the user will have to pass the entire conversation history every time, that doesn’t seem right. If the chat bot keeps a log of their chat history and reads it very quickly every time to build up context, should we provide an external storage for the chat bot to store that log just for the sake of keeping the bot stateless?

Keeping the bot “stateless” has advantages and disadvantages. First of all there is no need to pass the “entire conversation” for the purpose of following a chat. A client should be perfectly capable of ordering a list of sequenced chat items as they are broadcast and dogmatic statelessness would make it impossible to join a chat.

So at the very least there must be a serverside concept of a “conversation” that clients can join and receive broadcasts from. Now all the chat items could become part of that “conversation state” but that wouldn’t be broadcast with every new chat item though it may be sent to newcomers as they join a chat late.

But the “conversation” is a separate state from the client states even though the “conversation” relates to the clients it broadcasts to and the clients relate to the “conversations” they are participating in (and the “full” client state may not even exist in the “Elixir space”).

For me this topic suggested a Carte Blanche “all server side state is OK” free-for-all that made no attempt to justify why any type of state needed to exist in the first place.

What you describe is a process with a clearly defined (narrow) responsibility where its state is (private and) essential to the fulfillment of it’s objective. I would also expect that the process “outsource” any “real work that could fail” in order to protect integrity of that state i.e. launch a separate process with just enough information to perform the download.‡

There is nothing wrong with that kind of state. What I’m cautioning against is state-oriented design which borders on “object-thinking”.

(‡ As a design guideline I favour short-lived processes simply to minimize the possibility of corruption of their state. However there will always be long-lived processes with state. Again to minimize corruption of state these processes should do as little as possible. However they shouldn’t simply be containers of state. They should be smart enough to take a request, augment it’s data with information from the process state and forward the actual work to “somewhere safe”.)

CptnKirk · December 6, 2017, 10:34pm

All of this is true and there have been some great comments in this thread already. To add to them…

Some actor based libraries in other ecosystems exist and cover this exact use case. On the JVM you have Akka (https://akka.io/). Akka Cluster + Sharding + Persistence gives you exactly this model.

This stateful model is attractive because you can very easily reason about the state of your system. That is a huge bonus. In your typical “stateless” model, you’ll store your state in a DB and in a cache and access your cache/DB combo via stateless business logic. This makes it easy to drop in more stateless workers. However, you can run into missing writes and other cache consistency issues. Especially given that caches typically write whole objects at once (vs just changed fields) and lack optimistic locking support. You also have the guaranteed overhead of a network hop + full object GET in order to perform business logic, plus another hop and PUT if you need to write back changes (and then you need to write back into the DB). You could try and avoid the cache GET by putting in a smaller cache on your “stateless” workers, but then you have two caches you need to worry about keeping in sync.

The stateful cluster approach merges the business logic and the cache, and cleanly supports field level updates along with event sourcing. You model it exactly as you describe. You get all the benefits that you describe. But there are downsides and gotchas. Let me walk you through some of them…

Split brain clusters are a major problem - When the stateful model is up and running, it works wonderfully. But in the case of network failures, you need to be very careful. This happens when the network link between some cluster nodes goes down, yet the connection between your LB and these nodes remains up. You will end up with two stateful clusters, each managing state independently. This situation needs careful consideration and an automated resolution strategy. This also means you need at least 3 nodes to start a cluster so that a split can be detected.
Process registration can be a problem - Yes I believe that Elixir has a distributed process registry. But be sure and check the fine print. How long does it take to register a new process across 100 nodes? How feasible is it to have millions of tracked processes? If a node is restarted, how long does it take to reload and reregister the million processes that node was tracking? There may be very good, positive, answers to these questions. But you need to ask them.
Process fail-over needs to be thought about - What should the system do in the event of a node crash? Process A knows it needs to route a message to Process B, but Process B has crashed, or not responding. Now what? What is the latency incurred in these cases? Often it is unacceptably high because it isn’t easy to for the system to deterministically detect a failure and relocate a single Process B to another location within a few milliseconds.
Cluster remoting protocols aren’t necessarily optimized - While you might be able to use Distributed Elixir to implement this model, is Distributed Elixir optimized for low latency, high throughput messages? This was a problem for Akka in the past as well. While the model worked, the naive serialization commonly used by these protocols isn’t nearly as performant as their dedicated caching counterparts (or even JSON over HTTP). Care needs to be taken here as well. Are node heartbeats being sent across this same channel? If so, watch out.

From experience, I can say that the promise is real. When it works, it works really well. Just be sure to account for the situations when things aren’t working well. Elixir promotes “letting it crash”. But a crash should not cause seconds of latency while the system recovers from this crash. A code deployment shouldn’t cause massive service disruption. But these are hard problems to solve in stateful clusters, especially ones that try and have an exactly one processor model. Ensuring that you only have a single actor/process owning that state adds complexity, and often times time overhead in the failure case.

The best stateful cluster model I’ve seen came out of Basho. The Riak KV store used this stateful model and developed a ton of great technology to manage the solution. I believe that all of the Basho code is now open source, so interested parties may want to look at their cluster libraries to start with. They also took a hashing approach to routing and allow for multiple possible process owners, along with hinted hand-off (and hand-back). But now the process guarantees shift. The single mailbox model isn’t really there anymore. You’re much more eventually consistent and now may need to deal with things like vector clocks, siblings and a whole bunch of other complexity you hadn’t counted on.

As someone else points out, the Phoenix project would benefit from clustered stateful sessions among other things. I’d assume that as channels and stream processing become a more ubiquitous programming model, getting events from these channel sources to stateful processing entities becomes a standard challenge. A hard problem to solve, but Elixir is set up better than most to tackle it head-on. If Elixir were to provide a high-quality solution to this problem, it could evangelize the benefits over pretty much every other web stack out there.

Until then, consider how risk adverse your project is. Until high-quality implementations exist that address some of the problems of stateful actor clusters, you might want to stick with a conventional stateless model. You’ll have all the same problems everyone else has, but you won’t have new ones nobody else has.

qwerescape · December 6, 2017, 10:46pm

this is so awesome, thank you!

CptnKirk · December 6, 2017, 11:42pm

The Basho guys put in a lot of research/effort in this area (as did/do the AWS Dynamo guys who got the ball started).

This is old and Basho is dead (but the technology lives on), but this was and probably still is some good reading. The biggest problem I ever had with Riak’s architecture was their association of vnodes with physical disk processors. The number of vnodes was set at the time the cluster was defined. This caused lower and upper bound pressure as you went to elastically scale up and down outside of your pre-estimated sweet spot. Any future advances should revisit this limitation.

hubertlepicki · December 11, 2017, 7:24am

One note: it is tempting and initially “makes sense” to build your runtime entities in such system based around state entities. For example, whenever you have “user” entity, that has name, password and email, arguably the most obvious thing to to would be to spawn processses per user. You can adopt DDD’s terminology of “aggregates” to call those, or “grains” from Orleans framework mentioned above. Then you would model your business processes entirely within those “aggregates” or spawn “sagas” - processes coordinating stuff across multiple processes with message passing.

This is a trap, and in my experience, biggest problem of modeling systems that do keep state in memory. It is way easier to spawn runtime entities that map to business processes. Think “registration” process vs “user” process. You may still want to organize your data around processes, i.e. still have “user” processes but in such case try limiting their responsibility to serving, and persisting the data. No business logic if possible.

If you don’t do that, you will end up with a system that has large entities, “user” is usually a great candidate for entity that tends to bloat, hard to manage and understand. And most importantly - with runtime entities not matching business processes, which makes it difficult to coordinate.