An experimental implementation of actors that do not die

This experiment is similar to other techniques such as the virtual actor model implemented being orleans.

I have used the term entity to refer to virtual actors, as much as possible I tried to use the term process when talking about actors as you might understand them from experience with the beam.

Example Entity

defmodule MyApp.PingPong do
  use Pachyderm.Entity

  def init(_id), do: nil

  def activate({:ping, client}, nil), do: {[{client, :pong}], :pinged}
  def activate(:pong, nil), do: {[], :ponged}
end
  1. activate returns a list of two tuples consisting of another entities address and message to send AND the new state of the actor

  2. If an activation fails the entity continues to exist with previous state. Because entities are effectively immortal there is no concept of linking or monitoring.

  3. This model chooses consistence first, within a given entity.
    This is not suitable for all parts of a system, I can imagine a structure where a set of these entities forms the data spine as part of a larger system.

Running entities.

There are only two actions that can be done with an entity.

  1. Send it a message
  2. follow and entity and receive a message every time it’s state changes
iex> alias Pachyderm.Ecosystems.LocalDisk
# Pachyderm.Ecosystems.LocalDisk
iex> alice = {MyApp.PingPong, "alice"}
# {MyApp.PingPong, "alice"}
iex> bob = {MyApp.PingPong, "bob"}
# {MyApp.PingPong, "bob"}

iex> LocalDisk.follow(alice)
# {:ok, nil}
iex> LocalDisk.follow(bob)
# {:ok, nil}

# Alice pings Bob
iex> LocalDisk.send_sync(bob, {:ping, alice})
# {:ok, :pinged}

iex> flush()
# {{MyApp.PingPong, "bob"}, :pinged}
# {{MyApp.PingPong, "alice"}, :ponged}
# :ok:

Status

At the moment this is a proof of concept for a single node but I hope to extend this shortly.
See README for more info.

At the moment I’m just looking for comments on if this is a good idea? valuable in the context of Elixir/erlang?

6 Likes

This is definitely valuable in the context of the BEAM. Are you aware of erleans, Partisan, LASP and Firenest? (I am on my phone on the way to a plane right now so my apologies for not including links in this post)

I would like to discuss the idea of not providing links and monitors: How do you handle crashes? Does the Entity then just keep its inconsistent state until the end of time rather than being restarted with a consistent one?

2 Likes

I’ve looked into all of them excepts firenest.

State is only saved after a full execution. If the process crashes during an execution from S2S3 the new state (S3) is never saved. So the state of the actor remains consistent in S2.

This is why all outbound messages are part of the return value so if there is a crash then no side-effects occur. In this case a crash can just be considered as a lost message. This is a good thing because handing the case of a lost message will always be something that has to be dealt with

4 Likes

I apologise for my briefness before ^.^.

The one potential problem I see with this approach is that one of the actors tries to update its state, and crashes. It will then re-set to the version before this latest execution step, but the next time it wants to do something, it will crash again: Maybe the state that was saved before was not a ‘valid state’ anymore already. The ‘let it crash’ concept would be to re-set the process to a known ‘valid state’.

Of course, it might be very difficult to manage this in a persistent actor system, but this is exactly why it is important to talk about it, to ensure this abstraction will not be leaky.

So picture the following scenario:

  1. I have a system with a persistent actor running in production.
  2. It turns out I made a mistake somewhere (I missed an edge case), meaning that the process now is in an inconsistent state, and unable to continue since making any further calls to it will result in a crash.
  3. How can I now reset the system to a working state? How do I ‘migrate’ a persistent actor in the face of these problems?

I think this is a very cool library, and as soon as I have some time, I’ll definitely look under the hood, because it is closely related to what I am currently thinking about a lot. I really like the idea to be able to switch distribution mechanisms to switch between CP, AP and easy-to-test.

:+1:!

2 Likes

Pachyderm now available on hex.pm

Very alpha at this point. Only supported persistence mechanism is to local disk via Dets.

2 Likes

I’m thinking of state updates as indisputable facts. Because the number of followers(subscribers) can be unknown once a state is accepted it cannot be retracted. It can be replaced by updating the code and sending new messages. This is the same level of constraint that eventsourced systems have to work with, where and event has to be applicable to an aggregate

1 Like

PgBacked consistency

So I now have a working implementation to ensure only one entity is running in a cluster of nodes, by using a db to provide locks.

The implementation is when starting a worker is:

  1. Check out the advisory lock from postgres associated with a given entity.
  2. Then register process in :global.

:global is used for discovery and it is good at this. The only problem with :global is the fact it can suffer from split brain and allow two of the same process to exist. by using the db as a lock service this problem is eliminated.

If a node cannot take out the lock or find the process it wants in global it stops processing the work for a given entity. Choice is for consistency over availability.

Eventsourcing

As I eventually want to move away from having a DB as a dependency I have to consider protocols from replicated state machines. The choices being Paxos derived or Raft. Because Raft uses log replication I am wondering if modelling entities as event sourced by default is a better primitive.

5 Likes

To use Postgres advisory locks that are tied to a session you cannot use Postgrex’s default connection settings as it uses a random exponential (:rand_exp) back-off strategy. This means the Postgrex process continues to run when the database connection goes down, at which point all of your acquired session-based advisory locks will have been released by Postgres. I couldn’t find a way of determining when a lock has been released by Postgres using Postgrex.

The workaround that I’m using for my Elixir EventStore is to start a separate monitored Postgrex process which is only used for locks. It is configured to use the backoff_type: :stop strategy (defined in DBConnection) which means the process will terminate with the connection, it also uses sync_connect: true which blocks the caller of start_link to carry out an initial connection database attempt. The configuration is designed to ensure the process is only running when the database connection is available. This Postgrex process is used to acquire all advisory locks and is monitored so that after the connection fails and it has stopped you can handle the advisory locks being released in your application.

It would be useful to provide this advisory locking behaviour in a stand-alone library. What are your thoughts?

Pachyderm is an interesting project which I’m keen to observe for inspiration.

4 Likes

I was thinking of extracting the advisory locking behaviour. It can be exposed as just a :via tuple. It would be nice if it could match the Registry api. however I think that combining bits of a distributed system are hard so it might not be that useful on its own

1 Like

For my use case I need to support multi-node deployment where the nodes aren’t connected to form a cluster. What I had in mind was an API where you can attempt to acquire an advisory lock, on success a PID is returned for a process which lives while the lock remains in place. You can then use standard Erlang monitors or linking to be notified when the lock process terminates to indicate you’ve lost the lock.

I thought about this and decided that it wasn’t a good enough message for when you loose the lock.

Because DOWN message from the lock process terminating will stay in the mailbox until a receive is called work is not automatically stopped on loosing the lock

Yes, that’s a good point. What do you consider a better approach, have the lock tracking process kill the acquiring process immediately?

I don’t think you can have a lock for a process that can behave arbitrarily. Not without accepting some possibility of two processes running at the same time. I think that you need to wrap up the writing to the db in a transaction that also checks the lock. This is why the pachyderm project as a whole, certainly until I can convince myself I know how to compose the pieces if I promise guarantees of individual components

This manifesto for Actor Database systems is very close to most of the goals I was looking to achieve. Although I had less priority on the Indexing and Querying.
It is certainly an interesting read.

2 Likes

Have you investigated using disk_log? I’ve been considering it for some append-only accounting logs.

1 Like

I’ve used it for some experiments however it is not useful for this project. I want actors that do not die even if the machine they are on is burnt to the ground for this I need replication rather than writing to disk