qwerescape
Elixir enables stateful web applications, is it wrong to think like this?
The concurrency model of Elixir is really fascinating, along side immutability, they are my favourite things about Elixir. Recently I’ve been thinking about how Elixir can help me manage state in a distributed web application. I am looking for people to enlighten me to agree/disagree with my thinking.
In my opinion, stateful applications are a lot easier to reason about than a “stateless” application, I am putting “stateless” in quotes because I haven’t really seen a useful application that doesn’t have any side effects/states. I think the reason why people choose stateless because there is no good technology stack that allows them to do stateful safely. In my past experience with web applications in other languages, there are really 3 ways to manage state:
-
Client side state: basically the entire state is passed from front to back in every request, the problem with this is if I have a web front end and a mobile front end, they will end up overwriting each other. To synchronize them is difficult, and I’d say any technique can only shrink the window where the race condition happen instead of preventing it.
-
Server side state: this is very problematic in a clustered environment, you might end up with multiple states on different servers that all represent the same user. You can kind of solve it by sophisticated routing based on cookie/request param to make sure the same user (even with different devices) always end up on the same server. On top of that you need to make sure you have some thread safe data structure on the server side that can load/save the user state.
-
Database state, stateless server/client side: the most used case, unfortunately also the slowest one because in most cases it means a network hop. If combined with a stateless frontend + server side, you can get into race conditions: the husband trying to checkout on the website, but the wife is deleting the same cart on the mobile app; customer tries to add the same product on both web and mobile at the same time, but the business rule is that 1 custom can only buy 1. To solve those issues, we often resolve to adding database constraints, unfortunately any logic that we put in the database layer is not unit testable or easily understandable.
So all those problems, I feel like I can solve them by Elixir!
In an Elixir app, my mental model is processes interacting with each other, processes are globally addressable, and they don’t all have to live in the same server. So if I have:
- one process per user to hold state
- that process will periodically/asynchronously persist the state to the database just in the case
- any front end request can hit any server in the cluster, but since processes are globally addressable, i can always route the message to the right process
- if processes crashes, the supervisor will restart process with the last known state persisted.
- since all the state is in memory, I’d imagine it will be very fast.
I am not very experienced with Elixir, please share what you think, am I missing something that will prevent this from working well?
Thanks
Most Liked
peerreynders
Autonomy is important but I suspect that you are primarily talking about runtime autonomy - design/maintenance time autonomy can be even more important. I suspect that your view on dependencies needs a slightly more measured approach.
Ticket Customer should absolutely minimize any dependencies on the Ticket Service, i.e. it should be loosely coupled. Meanwhile Ticket Agent, Agent Supervisor, and Ticket Vault are working together towards implementing the responsibilities of the Ticket Service - that is their job, so they need to be interdependent to work toward their common goal - “dispensing tickets in accordance with the rules of the service”. So within the boundary of the Ticket Service those three are subject to high cohesion and high coupling because they need to share certain details about the “dispensing business” that are “nobody’s business” outside of that boundary. As long as the Ticket Customer remains oblivious to these “business details”, Ticket Service can change the “internal business practices” with impunity - e.g. switch implementations from Scenario B to Scenario A or vice versa.
Also I’m not arguing against autonomy over state, as I said before, state is unavoidable but when it appears it is worth scrutinizing whether it is necessary and whether it appeared in the right place.
Ultimately I was responding to this:
one process per user to hold state
You seem to be more concerned about where “state” goes rather than “what you are trying to accomplish”.
State shouldn’t be the primary design concern - are you “getting done”, what needs to be done? - that usually is accomplished by dividing up the responsibilities (not state). A message-based system works by moving data (events) from process to process - that message data and its movement is what is important.
Some processes will have state as a result of their responsibility and on the most general level a process is a message processor first and a state container second (and only if absolutely necessary). A “user” is a concept that may entail many responsibilities - so those responsibilities could well be spread across multiple processes - some of them possibly handling multiple or even all users if that is what is necessary to fulfill that particular responsibility.
It would be a mistake to select a single process as a locus of state and then aggregate all the responsibilities that need access to that state into that process like this:
- one process per user to hold state
- that process will periodically/asynchronously persist the state to the database just in the case
I see that and I see a process version of Active Record. Mixing responsibilities was a bad idea with objects and still is a bad idea with lightweight processes.
if the bot doesn’t remember context, the user will have to pass the entire conversation history every time, that doesn’t seem right. If the chat bot keeps a log of their chat history and reads it very quickly every time to build up context, should we provide an external storage for the chat bot to store that log just for the sake of keeping the bot stateless?
Keeping the bot “stateless” has advantages and disadvantages. First of all there is no need to pass the “entire conversation” for the purpose of following a chat. A client should be perfectly capable of ordering a list of sequenced chat items as they are broadcast and dogmatic statelessness would make it impossible to join a chat.
So at the very least there must be a serverside concept of a “conversation” that clients can join and receive broadcasts from. Now all the chat items could become part of that “conversation state” but that wouldn’t be broadcast with every new chat item though it may be sent to newcomers as they join a chat late.
But the “conversation” is a separate state from the client states even though the “conversation” relates to the clients it broadcasts to and the clients relate to the “conversations” they are participating in (and the “full” client state may not even exist in the “Elixir space”).
For me this topic suggested a Carte Blanche “all server side state is OK” free-for-all that made no attempt to justify why any type of state needed to exist in the first place.
What you describe is a process with a clearly defined (narrow) responsibility where its state is (private and) essential to the fulfillment of it’s objective. I would also expect that the process “outsource” any “real work that could fail” in order to protect integrity of that state i.e. launch a separate process with just enough information to perform the download.‡
There is nothing wrong with that kind of state. What I’m cautioning against is state-oriented design which borders on “object-thinking”.
(‡ As a design guideline I favour short-lived processes simply to minimize the possibility of corruption of their state. However there will always be long-lived processes with state. Again to minimize corruption of state these processes should do as little as possible. However they shouldn’t simply be containers of state. They should be smart enough to take a request, augment it’s data with information from the process state and forward the actual work to “somewhere safe”.)
CptnKirk
All of this is true and there have been some great comments in this thread already. To add to them…
Some actor based libraries in other ecosystems exist and cover this exact use case. On the JVM you have Akka (https://akka.io/). Akka Cluster + Sharding + Persistence gives you exactly this model.
This stateful model is attractive because you can very easily reason about the state of your system. That is a huge bonus. In your typical “stateless” model, you’ll store your state in a DB and in a cache and access your cache/DB combo via stateless business logic. This makes it easy to drop in more stateless workers. However, you can run into missing writes and other cache consistency issues. Especially given that caches typically write whole objects at once (vs just changed fields) and lack optimistic locking support. You also have the guaranteed overhead of a network hop + full object GET in order to perform business logic, plus another hop and PUT if you need to write back changes (and then you need to write back into the DB). You could try and avoid the cache GET by putting in a smaller cache on your “stateless” workers, but then you have two caches you need to worry about keeping in sync.
The stateful cluster approach merges the business logic and the cache, and cleanly supports field level updates along with event sourcing. You model it exactly as you describe. You get all the benefits that you describe. But there are downsides and gotchas. Let me walk you through some of them…
- Split brain clusters are a major problem - When the stateful model is up and running, it works wonderfully. But in the case of network failures, you need to be very careful. This happens when the network link between some cluster nodes goes down, yet the connection between your LB and these nodes remains up. You will end up with two stateful clusters, each managing state independently. This situation needs careful consideration and an automated resolution strategy. This also means you need at least 3 nodes to start a cluster so that a split can be detected.
- Process registration can be a problem - Yes I believe that Elixir has a distributed process registry. But be sure and check the fine print. How long does it take to register a new process across 100 nodes? How feasible is it to have millions of tracked processes? If a node is restarted, how long does it take to reload and reregister the million processes that node was tracking? There may be very good, positive, answers to these questions. But you need to ask them.
- Process fail-over needs to be thought about - What should the system do in the event of a node crash? Process A knows it needs to route a message to Process B, but Process B has crashed, or not responding. Now what? What is the latency incurred in these cases? Often it is unacceptably high because it isn’t easy to for the system to deterministically detect a failure and relocate a single Process B to another location within a few milliseconds.
- Cluster remoting protocols aren’t necessarily optimized - While you might be able to use Distributed Elixir to implement this model, is Distributed Elixir optimized for low latency, high throughput messages? This was a problem for Akka in the past as well. While the model worked, the naive serialization commonly used by these protocols isn’t nearly as performant as their dedicated caching counterparts (or even JSON over HTTP). Care needs to be taken here as well. Are node heartbeats being sent across this same channel? If so, watch out.
From experience, I can say that the promise is real. When it works, it works really well. Just be sure to account for the situations when things aren’t working well. Elixir promotes “letting it crash”. But a crash should not cause seconds of latency while the system recovers from this crash. A code deployment shouldn’t cause massive service disruption. But these are hard problems to solve in stateful clusters, especially ones that try and have an exactly one processor model. Ensuring that you only have a single actor/process owning that state adds complexity, and often times time overhead in the failure case.
The best stateful cluster model I’ve seen came out of Basho. The Riak KV store used this stateful model and developed a ton of great technology to manage the solution. I believe that all of the Basho code is now open source, so interested parties may want to look at their cluster libraries to start with. They also took a hashing approach to routing and allow for multiple possible process owners, along with hinted hand-off (and hand-back). But now the process guarantees shift. The single mailbox model isn’t really there anymore. You’re much more eventually consistent and now may need to deal with things like vector clocks, siblings and a whole bunch of other complexity you hadn’t counted on.
As someone else points out, the Phoenix project would benefit from clustered stateful sessions among other things. I’d assume that as channels and stream processing become a more ubiquitous programming model, getting events from these channel sources to stateful processing entities becomes a standard challenge. A hard problem to solve, but Elixir is set up better than most to tackle it head-on. If Elixir were to provide a high-quality solution to this problem, it could evangelize the benefits over pretty much every other web stack out there.
Until then, consider how risk adverse your project is. Until high-quality implementations exist that address some of the problems of stateful actor clusters, you might want to stick with a conventional stateless model. You’ll have all the same problems everyone else has, but you won’t have new ones nobody else has. ![]()







