Thought Experiment: Drawing a Pond Simulation

JEG2 · June 21, 2016, 12:17am

I’ve been playing with several simulations for a while now, to learn more about how Elixir’s processes work. Now I am cleaning up some ideas in preparation for a training at ElixirConf.

I’ve seen multiple things that work well and plenty that don’t. I have a lot of ideas about what I want to show and talk about. However, I would love to see how others would approach the same challenges. Maybe your ideas are even better than mine!

If you’re interested in contributing, I would love to hear how you would design a solution to the following problem. Don’t feel like you have to write code. You can just describe what processes and data structures you would use.

I may have followup questions, depending on how this goes.

Here’s the problem:

Draw a 64 lily pads wide by 128 lily pads high pond with 3,000 turtles and 3,000 frogs distributed randomly across them. No critter should share a pad. Each critter periodically counts surrounding neighbors of the same type. If the same type count is less than 30% of the total number of neighbors (up to eight), the critter will try to move to an open pad in a random direction, some random distance between one and five steps away.

Thanks in advance for any input you provide!

terakilobyte · June 21, 2016, 3:55am

Do they all evaluate the 30% rule at the same time or 1 by 1?

If they are below 30% but one of their species jumps in and raises above that threshold, should they still move or no?

Qqwy · June 21, 2016, 7:22am

All right, let’s give this a go.

64*128 + 3000+3000 = 14192. This is quite a lot, but I think I would still go for the full ‘everything is a process’ approach here, because it makes it the most easy to extend the system in the future (e.g. adding extra logic to a lilypad, adding extra logic to either of the critters, the moving logic, et cetera).

Every Lilypad is a process

Every Lilypad is a process. These are started in two steps by the main process (the ‘world’):

Fill a map with the {x,y} tuples as keys, and their values will be the PIDs of the Lilypond-process at that location. (optionally passing each of them their position at startup).
Connect all Lilypond processes, by telling each of them the PIDs of their horizontal, vertical and diagonal neighbours. When passed these PIDs, a Lilypad will start monitoring them.

When a Lilypad process crashes, this is seen by its neighbours, and it is then removed from their neighbours-list, so no critter can try to travel to a Lilypad that doesn’t exist anymore.

The World is linked with trapping exits to the Lilypads, so if one of them crashes, a new one is started that responds to the same location, which is then added to its neighbours (and the neighbours to itself).

Every Critter is a process.

Every Critter is a process. They are started by the world after the Lilypads have been generated. To do this, the map of (coordinates => pids) is shuffled, and then the first part of that is used as their new homes.

Critters are processes that determine ever so often if they want to move. They use Process.send_after to send themselves a message periodically to see if they are lonely (less than 30% of their eight neighbouring lilypads is taken. I in other words: less than three neighbours)

To check if they are lonely or not, they do the following:

Ask the Lilypad they are standing on for a list of its neighbours
Ask each of these neighbours if it contains something.

If there isn’t enough, we have a list of neighbours that are empty, so we can pick one at random and travel in that direction. After traveling, we repeat this procedure, with the only exception being that the connection we travel in is now fixed (so if we cannot travel further in a certain direction because a Lilypad is already taken, we stop).

‘Travel’ means that the Critter unlinks and deregisters itself from the Lilypad it came from, and links+registers itself to the Lilypad it went to.

Problems with this approach

The only ‘glaring’ problem with this solution is what happens when a Lilypad with a critter on it crashes: Right now, the Critter is not restarted, it just ‘drowns’(disappears) when the Lilypad it stood on ‘sinks’(crashes). This might be prevented by having the Critter<->Lilypad connection use monitors instead, and have a Critter that had to swim because the Lilypad it stood on does no longer exist periodically ping the World process to ask if there is already a new Lilypad for this location.

To simplify all of the ‘check if a Lilypad exists at a certain location’ checks, one could use name registration for the processes. However, with the amount of processes we have here, this might actually slow the system down. So this is something we should measure before we go one way or the other.

edit: as suggested below, this answer also has problems with drawing the resulting state.

Why this approach?

All of the components are easily extendable.
When one Lilypad or Critter dies, the rest of the system is unaffected.
There is no single-process bottleneck. The World process is only used during startup and during ‘where is the new lilypad’ resolution in the case one of them has crashed and needs to be restarted.

JEG2 · June 21, 2016, 3:59pm

I’ve purposely tried not to specify timing, because I feel like that’s a very interesting variable here. I’m open to either approach.

Yes, next time the critter checks they would decide they are now unhappy and seek to move.

JEG2 · June 21, 2016, 4:15pm

This is a neat approach.

It does change the way the simulation works a little bit from the way I have seen it done in the past. I have seen a critter jump out of a group when surrounded (because they can move up to five spaces). Using your approach though, the intervening pads must also be open.

I do think this is a cool aspect of your processes-as-a-datastructure solution.

However, I think there’s one aspect you haven’t yet considered: how do we manage the drawing of the current state of the pond?

bbense · June 21, 2016, 4:41pm

However, I think there’s one aspect you haven’t yet considered: how do we manage the drawing of the current state of the pond?

MOAR PROCESSES…

The View of the pond is essentially a log of all the state changes. The tricky problem is ordering these state changes. One approach would be to use mnesia to create a virtual shared log, another approach would be to simply timestamp each state change and use standard logging services. The recent changes to time in Erlang 18 ( i.e. an absolute incremental timer) would be useful.

The current process model emits events, you need a separate set of processes to consume and visualize the events.

gregvaughn · June 21, 2016, 6:10pm

I could do it in a single process. I’ve done Conway’s Game of Life that way. This would be my approach if I were to interpret the critter movement as being a synchronized “turn”.

However, I suspect you’re looking to use this as an exercise in concurrency, so plan B would be to treat each critter as a separate process. I’d expect to have the lily pad map stored in an ETS table, likely with a helper function to get stats of what’s in neighboring cells. One other separate rendering process can get a snapshot of the ETS data and display it on a timed interval.

JEG2 · June 21, 2016, 7:30pm

Any thoughts on the differences of using ets verses an Agent for this? Just curious.

One problem I’ve seen some, using similar approaches to this, is the need to copy the current state into a rending process. A 64 by 128 pond is over 8,000 pads. Depending on the data structure, this memory copy can take time, enough to slow the rendering refresh rates down.

JEG2 · June 21, 2016, 7:33pm

14,192, actually.

gregvaughn · June 21, 2016, 7:51pm

ETS can allow parallel reads (IIRC). An Agent is going to serialize all reads. I haven’t done much with ETS yet, but this could be a great place to use match specifications. I need to try to find some spare time to try to code some of this. It’s starting to sound like fun

I’m getting all hand-wavey now because my knowledge is not that concrete. ETS tables can have a single owning process that is the only one allowed to write. This could be the one responsible for rendering. Each critter process would have to send a message in order to move. The receiving process could have a cached version of the data optimized for rendering that it updates as well as updating ETS.

I’m just guessing here without perf numbers, but that render-optimized data structure might be best served as an Erlang array because it needs random access updates.

JEG2 · June 21, 2016, 8:14pm

Question for the Processes All The Way Down Team: it seems that the only state we need to store is the current location of the critters and we could choose to keep that in critter processes or lily pad processes, so is there really a good argument for doing both?

JEG2 · June 21, 2016, 8:25pm

Sadly, this doesn’t eliminate memory copies, according to the documentation:

In the current implementation, every object insert and look-up operation results in a copy of the object.

gregvaughn · June 21, 2016, 8:32pm

I’m envisioning it only being written one cell at a time. And only read 8 cells at a time (or perhaps one single value if match specs can be used).

I was suggesting that the rendering process could be the writer/owner of the ETS table precisely so it can have an in-memory cache, optimized for rendering, of the whole lake and never have to read the entirety of it from ETS.

Qqwy · June 21, 2016, 8:49pm

Oh snap. I don know why I wrote down that completely unrelated number ^^'.

Yes, you are right. I did not consider this. This is sort of a problem, unless you decide that the speed at which you want to see the state does not need to be as fast as the simulation itself.

What I think is the best solution, is to have a single global, name-registered ‘observer’ process that receives notifications each time a Lilypad’s state changes. It does not contain references to any of the Lilypads, it only receives information like {x, y, type_of_critter_inside}. This can sort of be seen as a digital equivalent of ‘looking at’ something (i.e. light rays(messages) travel from the things that change state to the entity that observes). This single process will become a bottleneck, but because it is at the edge of our system, the simulation itself is not affected by it.

Of course, if the simulation runs so quick that this outside process is unable to work through the messages quick enough, it will start lagging further and further behind in the past.

We can go even further overboard, introduce vector-clocks for all state-change messages the Lilypads send and create a (non-OTP) process, reading the message queue as LIFO (last-in-first-out) and discarding messages whose vector clock value isn’t higher as the one in the current representation we have.

Introducing these vector clocks probably actually means introducing two vector-clock-like-values: One for how often a Lilypad has been restarted (This needs to be kept track of in the ‘World’ process and passed along to the newly restarted Lilypad each time), and a second for how many state-change-messages this version of the Lilypad has sent. I would argue that timers are a bad solution here, as we do not actually care how fast something has happened, only how often something has happened. Time has many peculiarities (leap seconds, changing clocks, network delays, CPU temperature et cetera) that make it hard to determine what will actually happen in a system.

I think that doing this is taking it too far, but I think this is possible.

Yes, I completely agree with you. Conway’s Game of Life is suited a lot better for a single process because the whole computation of moment n+1 depends on the state of each of the cells of moment n. It’s possible to start a separate process for each new cell and pass it its old state plus its neighbours, but because the next-timestep computation is so simple, this would incur a lot of unnecessary overhead.

In a situation like this Pond simulation where entities act on the current state around them and influence the same state at that time, however, processes seem like a natural fit.

What would definitely be interesting to solve concurrently, though, is a Hashlife algorithm.

Qqwy · June 21, 2016, 8:56pm

If you store the position of on what lilypad it is only in the critter, how would the critter know how many neighbouring critters it has?

Embedding the information of the critters on the Lilypads instead of in their own process is possible. However, the main reason to split them into two, is because of a separation of concerns. If either of these things grows in complexity (say, we want to have special lilypads that only turtles can walk on), it can do so without directly affecting the other. Also, consider what happens when one of the things crashes: If a Lilypad-embedded-with-critter-info crashes, that critter is lost. If a Lilypad crashes whose critter is in a separate process, this process can wait for a new Lilypad to appear at its location.

Conversely, if a Lilypad-embedded-with-critter-info crashes while executing its critter-movement code, the whole Lilypad crashes as well. In the case these are separated, again only the critter crashes and is restarted.

sasajuric · June 21, 2016, 10:21pm

My first hunch is to model the world as a purely functional data structure and maintain it inside a single process. This means there would be some tick operation the result of which is that some critters move (i.e. the world transitions to the next state). A main benefit of this approach is consistency. When everything is in a single process you can get a consistent snapshot of the world and move it to the next state predictably.

That’s not the case with everything-as-a-process. Since processes are separate, you can’t get a snapshot of the world, so you’re always operating on possibly stale knowledge. For example, counting neighbours means asking 8 processes. When you get a last response, all of them might already be stale, since all entities run independently. It’s also unclear when each entity is moving its own state. There’s less guarantees about order and no guarantees about synchronism, which is quite non-deterministic. Perhaps that’s actually desirable when simulating real-life, but it’s certainly harder to reason about.

If you wanted, I think you could add randomness to a single process world. For example, you can have more frequent ticks, moving only some random part of the world in each tick.

A single process world might also help with drawing. After you tick the world, you know all the changes, so you can send them to the drawing process which could probably render them in a single pass.

I’m not convinced about error isolation benefits. It seems to me in everything-as-a-process approach the processes would mostly perform a couple of primitive operations, such as keeping an occupied state, counting neighbours, or selecting a next destination. It seems fairly unlikely these things will crash. I mean, it’s always a possibility, but the processes seem extremely simple at this point. It’s also unclear what should happen when an entity process crashes. Perhaps a drawing process should be notified about this and clear the entity from display. Or maybe a supervisor might restart and resume the entity process, but then you’re restoring the state, so the process might crash again. The fault-tolerance point seem too fine-grained for this case. I’d probably opt to kill the entire world if something goes wrong.

I’m also skeptical about the performance/scalability benefits of multiple processes. Again, it seems these processes would do very little, so the cost of message passing will probably shadow possible gains of utilizing multiple cores.

Processes are great, and they should be used extensively, but there’s always a trade-off. Otherwise, we’d be running every single line of code in a separate process

As you say, there are 14192 entities, which doesn’t seem so large Erlang is not known for raw CPU speed, but I’m positive it could work efficiently with that size, assuming proper data structures and algorithms are chosen. For some larger scale, I’d consider partitioning the world, then on every tick handle each partition separately. This would still keep synchronism (and thus consistency) while introducing some potential for parallelism.

Standard disclaimers apply of course It seems to me the problem is still a moving target, so it’s hard for me to identify with it its nuances. That said, single-process-world would be the road I’d explore.

It’s worth mentioning that Torben Hoffmann did an everything-as-a-world take on Conway’s game of life. You might find the video useful if you decide to take that approach.

Best of luck!

JEG2 · June 21, 2016, 10:43pm

Just to be clear, I totally respect the value of a single process approach. My reasoning for doing this exercise is that a simulation allows you to see the processes moving around. I’m using it as a teaching tool.

gregvaughn · June 21, 2016, 10:57pm

I’ll add one more point here that I had meant to mention earlier in the thread. Francesco Caesarini’s talk from the first ElixirConf was great. It discussed a lot about scalability bottlenecks. He mentioned at one point that you should use processes for each “concurrent activity” in your system. At the time that was vague to me, but I’ve come to understand it better and embrace it myself. If you take this approach, then there is no reason at all to have the cells of the pond be processes. The critters – sure, but not the pond cells.

If you use processes for the wrong sorts of things, they become bottlenecks. I understand the value of “MOAR PROCESSES” as an educational tool, however I caution you from letting students go away with the idea that is the best way to scale a system (without an analysis of what the concurrent activities are, that is).

terakilobyte · June 22, 2016, 3:53am

I’m still pretty new to programming and especially Elixir, but I think I’d start this off with just lily pads as processes with a world process over it.

They can have 4 possible states: empty, occupied by turtle and has neighbors open, occupied by frog and has neighbors, occupied but locked(ratio around it can never change or no open pads within range).

From the world process pulse an update call where each pad determines if the resident will move. When the pads report back, the movement happens as a message from the world process to each pad that will have a state change.

The overall world picture could be kept in a table, allowing asynchronous reads from the pads and used by the world to restart a pad with correct state.

As the program runs pads that will never change could let the world process know and then be skipped on subsequent pulses.

JEG2 · June 22, 2016, 4:00am

Point taken. I’ll be careful.

My current solution has a lot less processes than some are recommending here.