ProcessHub - process distribution library

Good catch! Thanks for the pr!
First pull request for the project :tada:

2 Likes

Released v0.2.5-alpha Release v0.2.5-alpha Ā· alfetahe/process-hub Ā· GitHub

This release includes bug fixes and documentation improvements.

2 Likes

this is a bit naive question, but how does this compare to using the built-in pg module for keeping track of processes in a cluster and doing service discovery that way?

3 Likes

This is a very good question! My initial plan was to use the pg module to track the registered processes.

Using pg would solve some of the issues, the biggest probably being keeping track of registered processes across the cluster. However, I would still need to figure out the underlying mechanisms to distribute processes, react to cluster changes, migrate processes to keep the load balanced, handle process state handovers, manage process replications, deal with network splits, and more.

My initial requirements kept growing, and I realized I needed all of that. At this stage, using pg to track processes was not an option anymore because I needed a way to store much more information regarding the registered processes. I could have used pg along with my custom ETS tables, but I decided to store all the data in the ETS tables and synchronize the table myself.

In fact, ProcessHub relies on another library I built which is based on pg. This same mechanism is used to synchronize the ETS tables between the nodes. Basically, I use the pg module to emit process registration/unregistration events to keep the ETS tables in sync. I use the same library for other pub/sub related jobs as well.

I would recommend others to use the pg module when starting out if the distribution requirements are not that strict because it is a really simple and reliable option. Once the distribution requirements become more stringent, there are other mature libraries that can help, and ProcessHub aims to be one in the future.

A small fact: I wasnā€™t initially planning to build such a library. I am actually writing software that needs this functionality, and at some point, I decided to extract the code into separate libraries. Thatā€™s how ProcessHub was born. :slight_smile:

2 Likes

Version 0.2.6-alpha released

This version includes bug fixes, new API functions, and minor improvements to the documentation.

Changed

  • Replaced Cachex with our custom implementation to enhance performance.
  • Updated the default values for max_restarts and max_seconds to 100 and 4, respectively.
  • Storage module now accepts the table identifier as the first parameter, allowing it to be used with multiple tables.

Added

  • Introduced new API functions get_pids/2 and get_pid/2 to get the pid/s by child_id.
  • New guide page for interacting with the process registry.

Fixed

  • Corrected an issue where local supervisor restarts were not properly updating the global registry.
  • Fixed various typos and errors in the documentation.
1 Like

Thanks for another update!

In my current project I use :syn (GitHub - ostinelli/syn: A scalable global Process Registry and Process Group manager for Erlang and Elixir.) for process registry for a pub/sub mechanism and :libcluster (GitHub - bitwalker/libcluster: Automatic cluster formation/healing for Elixir applications) as as an automated clustering mechanism as well as ProcessHub for process distribution and control. It does seem like there is overlap (there is!) but each does a good job for what i use it for i.e. i am using selected bits of each one as they all have there strengths and weaknesses.

I was thinking is it worth making the process tracking method selectable so you can use/make different implementations if you wish? i.e. yours, pg, syn, custom etc as each has trade offs.

My project is just a hobby project but its intended to make a changing cluster of machines look like a single machine with a single memory space and have the system behave as if its a single machine with automated handling of nodes joining and leaving the cluster and auto migration of processes to keep the system complete and running.

Its a play thing and speed and memory constraints are not of any concern and i intend to use a lot of processes like objects (similar to GitHub - wojtekmach/oop: OOP in Elixir!) but with all the genserver stuff hidden behind a macro DSL. You could then submit a program to the cluster without knowing or caring about where it is running.

Well enough rambling (sorry!), the joys of retirement is you can play with all these interesting techs!!

Best wishes

Shifters

4 Likes

I was thinking is it worth making the process tracking method selectable so you can use/make different implementations if you wish? i.e. yours, pg, syn, custom etc as each has trade offs.

While I wouldnā€™t rule out the possibility of making it configurable in the future, given that the project is still evolving, the process tracking and process registry are currently very coupled with other sections of the code, making it difficult to make configurable at this moment.

Well enough rambling (sorry!), the joys of retirement is you can play with all these interesting techs!!

I wish I can get there myself too :slight_smile:

1 Like

Good evening, version 0.2.9-alpha out! :slight_smile:

Includes some bugfixes related with binary child_ids.

3 Likes

Hello,

Any idea when your library will be ready for production use ? Any other library you would suggest in the meantime ?

Can you open-source this? Iā€™d love to take a look!

Hi, I plan to put the library through a testing phase in a real system within 6 months, but there are no guarantees.

The Horde library is very popular, and you might find it useful. Another option I recommend is building your own solution using Erlangā€™s :pg module.

Well Iā€™m kind of a beginner in elixir, I come from nodeJs + I donā€™t know erlang + I have 6 months ahead of me to release a mvp.
I donā€™t think Iā€™ll have enough time in that timeframe to learn Erlang and develop my own distributed solutionā€¦

Thatā€™s why Iā€™m looking for a better solution. I could use Horde but read that some people had a lot of problems with it. May be the easy way out for now is just to save data to a DB or use something like redis ?

I can when (if!) its finished/usable - i have restarted 3 times over 5 years, changing the tooling each time.

I have been taking a break from it recently (its more a winter project) so not much to actually share at the moment.

The DSL is very much a ā€˜in my headā€™ design rather than something i have coded up yet! Still reading the Metaprogramming elixir book and looking at other coding examples.

Itā€™s very much a play thing so has no schedule for development etc!

Shifters

I donā€™t know your specific requirements, which makes it hard to suggest anything.

Horde has been around for a while and is widely used, so I believe it performs well for its intended purpose.

You can also try this library, though it hasnā€™t been battle-tested yet. Itā€™s still in development, but reported bugs will be addressed.

New version 0.3.0-alpha released! :muscle:
This release focuses on performance and stability improvements.

Changelog:

### Breaking changes
- `Hook.post_children_start()` now returns: `[{child_id(), result(), pid(), [node()]}]`
- Fixed error handling on process startups and stoppings. Returning more information about the failures and partial successes. Example:`ProcessHub.start_children/3`, `ProcessHub.stop_children/3` with `async_wait: true` now on failures return: `{:error, list_of_failures(), list_of_partial_successes()}`. This does not affect successful operations.

### Added
- Added new option `on_failure: :continue | :rollback` to the `ProcessHub.start_children/3`, `ProcessHub.stop_children/3` functions. This option allows the user to specify what should happen if the operation fails.

### Fixed
- `ProcessHub.Strategy.Redundancy.Replication` was not properly updating the redundancy_signal value on some occasions
due to race condition.
- Fixed issue with `ProcessHub.Janitor` not purging the cache properly when using Gossip protocol.
- State passing on hotswap migration with graceful shutdown fixed.
- Timeout option was not properly used in some cases.

### Changed
- Improved typespecs across the codebase.
- Improved overall performance of start/stop/hotswap migration operations which involved synchronization and large
amount of message passing by using bulk operations.

I also conducted performance tests, and the results show that synchronous operations and migrations are now multiple orders of magnitude faster.
Many potential issues have also been resolved.

Thereā€™s still a lot to improve, but here are the results for spawning X distributed processes across a 10-node cluster and stopping them synchronously:

This includes all distribution logic, registry updates, hooks, event dispatching, and result messages for startups and shutdowns etc.

Up until 10k distributed processes in a 10 node cluster the library does okay imo. On average 2.6 seconds to start and stop everything synchronously. After 20k processes the jump is pretty big.

Operating System: Linux
CPU Information: Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz
Number of Available Cores: 8
Available memory: 31.23 GB
Elixir 1.17.1
Erlang 27.0
JIT enabled: true
  • 10 nodes, 10 children
Name                                         ips        average  deviation         median         99th %
start_&_stop_processes-0.2.10-alpha        27.43       36.46 ms    Ā±67.09%       33.20 ms       84.64 ms
start_&_stop_processes-0.3.0-alpha        290.82        3.44 ms    Ā±46.01%        3.26 ms        6.51 ms


Memory usage statistics:

Name                                       average  deviation         median         99th %
start_&_stop_processes-0.2.10-alpha        2.31 MB    Ā±31.30%        2.83 MB        2.83 MB
start_&_stop_processes-0.3.0-alpha         2.08 MB    Ā±36.49%        2.82 MB        2.82 MB
  • 10 nodes, 100 children
Name                                         ips        average  deviation         median         99th %
start_&_stop_processes-0.2.10-alpha         2.50      400.57 ms    Ā±62.90%      421.35 ms      863.76 ms
start_&_stop_processes-0.3.0-alpha         57.59       17.36 ms    Ā±47.88%       18.34 ms       35.52 ms

Memory usage statistics:

Name                                       average  deviation         median         99th %
start_&_stop_processes-0.2.10-alpha       31.32 MB    Ā±12.02%       32.24 MB       32.24 MB
start_&_stop_processes-0.3.0-alpha        24.02 MB    Ā±33.93%       32.13 MB       32.13 MB
  • 10 nodes, 1000 children
Name                                         ips        average  deviation         median         99th %
start_&_stop_processes-0.2.10-alpha         0.22         4.49 s    Ā±27.37%         4.42 s         5.76 s
start_&_stop_processes-0.3.0-alpha          7.00      142.89 ms    Ā±50.88%      172.62 ms      297.17 ms

Memory usage statistics:

Name                                       average  deviation         median         99th %
start_&_stop_processes-0.2.10-alpha      326.14 MB     Ā±0.01%      326.14 MB      326.34 MB
start_&_stop_processes-0.3.0-alpha       248.12 MB    Ā±32.89%      324.90 MB      324.94 MB
  • 10 nodes, 10000 children
Name                                         ips        average  deviation         median         99th %
start_&_stop_processes-0.2.10-alpha       0.0116       1.44 min     Ā±0.00%       1.44 min       1.44 min
start_&_stop_processes-0.3.0-alpha          0.38         2.60 s    Ā±40.86%         3.09 s         3.11 s

Memory usage statistics:

Name                                       average  deviation         median         99th %
start_&_stop_processes-0.2.10-alpha        3.48 GB
start_&_stop_processes-0.3.0-alpha         3.04 GB    Ā±28.03%        3.47 GB        3.47 GB

At this point the testing the old version took too much time and I just tested the newer version.

  • 10 nodes, 20000 children
Name                                         ips        average  deviation         median         99th %
start_&_stop_processes-0.3.0-alpha         0.193         5.18 s   Ā±111.63%         5.18 s         9.27 s

Memory usage statistics:

Name                                       average  deviation         median         99th %
start_&_stop_processes-0.3.0-alpha         5.75 GB    Ā±46.80%        5.75 GB        7.65 GB
  • 10 nodes, 30000 children
Name                                         ips        average  deviation         median         99th %
start_&_stop_processes-0.3.0-alpha        0.0277        36.09 s     Ā±0.00%        36.09 s        36.09 s

Memory usage statistics:

Name                                  Memory usage
start_&_stop_processes-0.3.0-alpha        11.47 GB
  • 10 nodes, 40000 children
Name                                         ips        average  deviation         median         99th %
start_&_stop_processes-0.3.0-alpha       0.00925       1.80 min     Ā±0.00%       1.80 min       1.80 min

Memory usage statistics:

Name                                  Memory usage
start_&_stop_processes-0.3.0-alpha        15.95 GB

Future releases will address the memory allocations.

4 Likes

Hi, we are currently running on a single node and due to scaling challenges (good problem to have!) we are looking at ProcessHub to move to a multi-node setup.

Here is the meat of our code that uses DynamicSupevisor and Registry:

  @doc """
  Start a child process or return an existing one by the given key
  - If the process is already exists for the key, it is returned
  - If the process doesn't exist for the key, it will be created with config in `{mod, opts}`
  """
  @spec start_child(
          supervisor(),
          term(),
          {module(), keyword()}
        ) :: pid()
  def start_child(supervisor, key, {mod, opts}) do
    # the registry is actually the name of the supervisor
    registry = supervisor

    case lookup_child(registry, key) do
      # Process already exists, just return it
      pid when is_pid(pid) ->
        pid

      nil ->
        {:ok, dynamic_supervisor} = Registry.meta(registry, :dynamic_supervisor)
        via = {:via, Registry, {registry, key}}

        # The child will need to be registered with a via tuple so that we
        # have a stable process name to reference
        opts = Keyword.merge(opts, name: via, restart: :temporary)

        case DynamicSupervisor.start_child(dynamic_supervisor, {mod, opts}) do
          {:ok, pid} -> pid
          {:error, {:already_started, pid}} -> pid
        end
    end
  end

If I wanted to swap to using ProcessHub, what things would I use to replace DynamicSupervisor and Registry?

2 Likes

Hi, something like this should work:

  @spec start_child(
          atom() | binary(),
          {module(), function(), list()}
        ) :: pid() | {:error, term()}
  def start_child(key, {mod, func, args}) do
    hub_id = :my_hub

    case ProcessHub.get_pid(hub_id, key) do
      pid when is_pid(pid) ->
        pid

      nil ->
        child_spec = %{
          id: key,
          start: {mod, func, args},
          restart: :temporary
        }

        start_res =
          ProcessHub.start_child(hub_id, child_spec, async_wait: true)
          |> ProcessHub.await()

        case start_res do
          {:ok, res} ->
            # Should be in this format: {"somekey", ["node1@127.0.0.1": #PID<0.225.0>]}
            # Or get the pid from the registry
            # `ProcessHub.get_pid(hub_id, key)`
            elem(res, 1) |> List.first() |> elem(1)

          {:error, msg} ->
            {:error, msg}
        end
    end
  end

I also changed few things:

  • No need to pass registry and supervisor references. You may want to pass the hub_id if you would like it to be dynamic or youā€™re running multiple hubs on the same node.
  • We need to construct Supervisor.child_spec() before sending to the ProcessHub (maybe in the future release {mod, args} will also be supported).
  • We no longer return only the pid but may also return {:error, msg} because this operation may involve message passing between nodes and at this point you never know what may go wrong.

Let me know if anythings unclear.

2 Likes

Thanks! I will give this a try :slight_smile: