Architecture for updating multiple states in isolation with implicit locking

lud · July 15, 2019, 11:49am

Sorry for the title, I did not know what to write.

Hello,

I’m trying to build an architecture for a game. In the game, there are
starbases which are basically event processors: they have a state,
receive different events and update their state accordingly.

I do not want to have a process per starbase because it will be many
of them, and they are idle most of the time. For example, they have
production lines: when a production of food is started, I calculate
the time of the production end, and I will send an event to the
starbase at this time. Durations can be in minutes so I have no use
of idling processes for several minutes. Also states take memory.

However I want all events for a given starbase to happen sequentially,
because I will load the state from database, handle the event, and
write the state.

So, what I plan to do is to have a worker pool and use
erlang:phash2/2 to hash the id of my starbase and send all events
for a given starbase to the same worker. This should be enough to
prevent concurrent read/write of the same state and work sequentially.

(A small word on distribution : if I have to work on different nodes,
those would represent solar systems, or solar system clusters, so a
given starbase will always be handled on the same node.)

If you think this is a bad design, i think you can skip the rest of
this post, and please tell my why.

Now, I have to update several starbase at a time, for example to
conclude a trade. Starbases declare what they need to but and what
they want to sell. I have some code running to figure out matching
orders. When I match is found, I want to run a task on a worker to
load both states, check if orders are still matching, update states
(mark orders as fullfilled, exchange money, move goods from the
“selling” cargo to the “sold-and-waiting-to-be-carried” cargo space).

The problem is that this task will run on the worker for starbase A,
and starbase B could be concurrently updated with another event that
will consume goods in cargo.

I am looking for a solution that could handle tasks for any amount of
starbases, not just 2.

So this is my current plan:

Hash all the keys (starbase ids) to find the workers. If all keys
give the same worker, just run the task as a single-key task.
With multiple workers, fetch the pid of the workers, take one pid
as :master and other pid(s) as :slave.
Run an inner task on the master worker: pass the slave keys
(starbase ids) to it, then the worker will wait for an ack message
with those keys and the slave pids from all slaves.
Run an inner task on each slave worker: pass the master pid, the
worker will send the ack message with the slave key to it. Then,
wait for a :finished message. All messages will be tagged with
the same ref to identify the event and have selective receives.
At this point, states cannot be changed by other events.
When the master received all acknowledgements from the slaves, run
the actual task (i.e. load states, handle the event, write states),
then send the :finished message to all slave pids.

It seems a bit complicated, no ? The main problem is to make workers wait doing nothing to provide locking.

I have another approach with a mutex I wrote that can lock multiple
keys without deadlocks, but that would require to lock the keys for
all tasks, even the single-key ones, so a lot more message passing and
a bottleneck on the mutex. I expect to have way more events for a
single state than events involving multiple entities.

Fl4m3Ph03n1x · July 15, 2019, 1:42pm

In Elixir you can have thousands and even millions of processes in a machine, BEAM is highly optimized and the amount of resources they cost is minimal.

As for state taking up memory, why would this be an issue? Your application needs state somehow, you will have to save it somewhere.

Ahh, cunning solution. I like it

So, from what I can understand, your problem is that when you make a transaction, between several nodes, that transaction needs to be atomic in the sense that no other operations can take place while the transaction is happening so you can avoid inconsistent states.

This is a tough problem, but one with many solutions. One of way avoiding state on workers would be to have your state on an ETS table, and then have a mediator that changes and updates state for all processes. This mediator would eventually become a bottleneck but it would also serve as a point of synchronization in your system. (Regarding bottlenecks, always benchmark first! A mediator will only be a bottleneck if it handles more messages than it can get!)

Another possible solution would be the use of optimistic locks. Locking in DBs is a world on it’s own, but I find that optimistic locks, besides having great performance, do work in a lot of cases when you have few collisions:

https://docs.jboss.org/jbossas/docs/Server_Configuration_Guide/4/html/TransactionJTA_Overview-Pessimistic_and_optimistic_locking.html

Aside from that, the only remaining solution I can think of is defining a worker communication protocol, which you already have defined and explained in your post.

Anyway, hope it helps somehow!

lud · July 15, 2019, 2:50pm

Hi,

In Elixir you can have thousands and even millions of processes […] Your application needs state somehow, you will have to save it somewhere.

Thank you for your answer. I know that I can have more processes than I would ever need, but yes the main problem with that is memory. It is a game so state can be pretty complex, and I do not want to have thousands of starbases state idling for nothing. With this architecture I can keep a low memory profile. Actually I could have processes that hold state for faster read/writes (instead of database/disk) and shutdown with a timeout ; wether the state is written to a transient GenServer with timeout / an EST table or directly do disk (I’m going to start with that) is another question I believe, the main problem beeing queuing updates and multi-state transactions.

Another thing with processes is errors handling. If I want to act on two states and rollback on errors, my state-holding-update-handling processes will need a way to cancel the last update, or at least a savepoint/commint mechanism. Whereas handling the two states update in a single process and rolling back is just a matter of not saving the new states.

So, from what I can understand, your problem is that when you make a transaction, between several nodes, that transaction needs to be atomic in the sense that no other operations can take place while the transaction is happening so you can avoid inconsistent states.

Not nodes, just states. Everything that I describe will happen on a single node. But yes, it is a transaction/concurrency problem. A mediator would be kind of equal to have only one worker. It is simpler, that is true, but I like the idea of having concurrency there. Maybe that is my mistake.

Optimistic locks are interesting and may be applicable in my case, I will research on that point.

Thank you very much.

al2o3cr · July 16, 2019, 3:02am

IMO worrying about this is premature; there are a lot of other problems you’ll need to solve before this particular one needs optimization.

What you’re describing with ack and finished messages sounds a lot like two-phase commit; there’s a substantial literature describing different approaches to solving that coordination problem.

The theme of your game has me wondering: should these interactions be instantaneous? The speed of light is already an issue in modern distributed systems, and interstellar distances (even with silly-fast FTL communication) would make it even more so.

Another thought: some real-world transactions are coordinated through escrow services. What about setting up something like that? A possible process:

party A wants to buy 1000 widgets from party B for a total of $10000
party A transfers the $10000 to the escrow service
party B transfers the 1000 widgets to the escrow service
the escrow service releases the money and the widgets

In a system context like this, you’d typically have the “transfers” above include a timeout so that if the escrow doesn’t complete the resources eventually revert to their original owners.

One note: the name escrow service shouldn’t be read as implying there’s one process doing this. It would seem simpler to spin up a new process per escrow interaction, to keep the state machine straightforward.

lud · July 16, 2019, 9:09am

Hey, thank you for your insight.

It could be seen as premature given two points: using a process to model game entities is the go-to architecture ; using transient processes to read, update then write state is more complicated and is an optimization.

I don’t think that those two points are always true.

I think the problem with processes is transient state: with an escrow process I will need a way to model many more changes in state:

money has been transfered, goods have not been received
goods have been transferred, money not received
received goods
received money
cannot receive good, receive money back
…

Not a big deal, but I apply this pattern to all interactions in the game il will grow fast. On the other hand, having a process that can handle two states seems simpler:

check that both parties can fullfill the trade
- decrease money, increase goods (what I said in OP is misleading, the goods transfer is immediate in state, the moving time is simulated)
- increase money, decrease goods at once
or cancel

Plus I need to inform players about those changes so I have to carry logs around, whereas with a single function handling the state I can just write logs at the end of the change: “this has been done”.

I don’t know if that makes sense for you.

Maybe trade was not a good example. Imagine docking a ship to a starbase, this is more “instantaneous”.

Deciding on this desing has a lot of implications over all the codebase. I seem to defend my point though I am still considering using processes to model entities, I do not have a strong opinion on this point, sadly.

Thanks again

amnu3387 · July 16, 2019, 8:25pm

Have you considered using mnesia? It has transactions, hence locks for free. You don’t even need to store things in it, you can just use it to acquire locks inside transaction on a given “key” for an imaginary table? Although I think it would be a good option to store at least some things there (the current resources for each entity? even if after each transaction you would dump the changes somewhere else). Unless you’re really constrained in terms of memory you can probably store millions of rows in it before you have a problem.

Other option is to use the global module, it can be used to set locks as well, the lock can be any term ({starbase_id, :ore}) along with another term as reference to the process requesting the “lock” (worker_id for eg.). So when a worker receives the trade request, it would create a lock for the entity selling on the item it’s selling and another lock for the entity buying on the :money resource. After complete it would release those locks.

This means other trades with any of the involved entities could still happen as longs as they didn’t request a lock on the same resource/s for the same “starbases”.

Both options would be usable in a distributed setting as well. Perhaps they’re not very orthodox solutions, but I don’t see anything wrong with either.

lud · July 16, 2019, 8:57pm

I didn’t think of Mnesia

Sounds like a good idea, though I am not comfortable with it to say the least ! I would have to measure memory usage but it could fit.

As for global, it is like a mutex, and I already have a working solution with the mutex. The main problem is that most of the writes require to lock a single key, and with a hashed queue and 100 workers I am good to go without locks.

Thank your.

dimitarvp · July 18, 2019, 11:15am

Sorry I can’t be more useful but just as a quick pointer, this latest ElixirRadar article was extremely interesting:

I know you’re not doing Phoenix in this case (or I think so) but Horde and DeltaCrdt are amazing and under-appreciated libraries doing a good chunk of what you intend to do.

lud · July 18, 2019, 5:09pm

Actually I don not need distribution but I’ll read it anyway, seems very interesting. And yes I use phoenix for browers clients !