Poolex - A library for managing pools of workers

general-CbIC · March 24, 2023, 10:20am

Hi

Poolex is an Elixir library for managing pools of workers.

In short, this is a poolboy written in Elixir.

When I started this project, I had the following goals:

To solve the problem of missing documentation of public interfaces and “official” library usage examples.
To bring this library back to life. poolboy is not actively maintained. Even if poolboy is written perfectly, there may be a chance of incompatible OTP changes in the future or the appearance of new features we’d like to use.
Try to rewrite this library in Elixir. It’s not a problem, but I’d like to use Elixir dependencies when I’m writing in Elixir.
To add the ability to use different strategies for getting a worker. I think a developer may have more needs than just choosing a LIFO / FIFO. So I added the ability to describe and use implementations for operating with worker and caller process queues.

Some project links:

general-CbIC · April 13, 2023, 2:29pm

Release 0.7.0

Added FIFO implementation for getting workers from the pool. This is the same mechanism as the :fifo strategy in poolboy.

general-CbIC · August 30, 2023, 1:51pm

Release 0.8.0

Since the latest major release, many improvements have been made:

Shutting down the pool and its workers is more accurate.
The work has been optimized by eliminating unnecessary ETS tables.
Fixed many bugs with handling exit of workers and callers processes.
Added missed validation for some initialization options.

Also, there are some breaking changes. I will quote the release note for 0.8.0:

Option :timeout renamed to :checkout_timeout.
- Reason: This option configures only the waiting time for worker from the pool, not the task’s work time. This naming should be more understandable on the call site.
```
# Before
Poolex.run(:my_awesome_pool, fn worker -> some_work(worker) end, timeout: 10_000)

# After
Poolex.run(:my_awesome_pool, fn worker -> some_work(worker) end, checkout_timeout: 10_000)
```
Poolex.run/3 returns tuple {:error, :checkout_timeout} instead of :all_workers_are_busy.
- Reason: It is easier to understand the uniform format of the response from the function: {:ok, result} or {:error, reason}.
Poolex.caller() type replaced with struct defined in Poolex.Caller.t().
- Reason: We need to save unique caller references.
Poolex.run!/3 was removed in favor of Poolex.run/3. The new unified function returns {:ok, result} or {:error, :checkout_timeout} and not handles runtime errors anymore.
- Reason: We should not catch errors in the caller process. The caller process itself must choose how to handle exceptions and exit signals.

As always, here are some important links:

general-CbIC · April 25, 2024, 9:16am

Release 0.9.0

With the last release, metrics were added to observe, analyze, and optimize the production configuration of your pools.

For now, implemented sending of pool’s size metrics: idle/busy workers count and current state “Is pool overflowed right now?”. The library starts to send these metrics via :telemetry if you pass pool_size_metrics: true on pool initialization:

children = [
  {Poolex,
    pool_id: :worker_pool,
    worker_module: SomeWorker,
    workers_count: 5,
    pool_size_metrics: true}
]

Also, I have added a plugin for PromEx library. You can check it out here.

dimitarvp · April 25, 2024, 11:45am

Nice library, thought of writing something similar in the past. Will review. Thanks for your work!

general-CbIC · April 25, 2024, 11:59am

Your kind words mean a lot to me. Thank you!

general-CbIC · August 26, 2024, 1:43pm

Release 0.10.0

Two new functions, add_idle_workers! and remove_idle_workers!, allow you to control the pool’s size at runtime.

For example, this is how you can add additional workers to an already running pool:

Poolex.add_idle_workers!(:my_lovely_pool, 5)

Dynamic pool size management can help you utilize resources more efficiently, depending on the workload.

Poolex package on Hex

general-CbIC · September 23, 2024, 2:37pm

Release 1.0.0

The new monitoring implementation no longer restricts the use of pool_id as an atom. Now you can use any valid GenServer.name() as pool_id.

PromEx plugin also bumped to v1.0.0.

Thanks to @spacebat and @al2o3cr for their discussion, this helped me realize the naming problem

darnahsan · October 18, 2024, 7:04am

Hi @general-CbIC I am facing this issue with poolboy, do you think using poolex could help

general-CbIC · October 23, 2024, 9:24am

Hi @darnahsan

I am not sure about it because I haven’t tried poolex with AMQP yet. You can quickly check it since it is easy to migrate from poolboy — Migration guide.

If I understand the problem correctly, AMQP connections terminate for some reason but do not leave the pool. It’s strange because both libraries (poolboy and poolex) should monitor workers and restart them when they die.

darnahsan · October 29, 2024, 5:34am

Thanks will give poolex a try to see . I think what happens to my AMQP connections is this heartbeat=0 is negotiated but not actually respected · Issue #112 · rabbitmq/rabbitmq-erlang-client · GitHub but why polboy get the dead conections is mystery

Asd · November 11, 2024, 5:22pm

Hi, nice library, I’ve read the code and I have several questions:

Why did you name the option pool_id instead of more common name for name registration?
poolboy uses high priority for the pool owner genserver process. Why didn’t you implement the same here?
Why do you use Agent for monitor references storage. It seems that it is redundant and just decreases the performance of the whole pool (I am preparing a PR with change to plain map).
Why do you use the Project.Private.Module naming schema? It is the first time I see something like this. It is strange because these Private modules are exposed in documentation which declares that they can be used by developer (which means opposite of Private)
There’s a very bad bug with monitor_caller function which spawns a monitoring process for every run call what goes against the idea of having a pool of processes in the first places (since you end up spawning process for every call anyways). I’d suggesting moving caller monitoring to the handle_call get_idle_worker logic and not removing the monitor until the worker is released. Overall, following checkout/checkin pattern would solve the problem. Plus, these processes will live forever while the caller is alive. Consider some long-living process (for example very common pattern of GenServer which executes some command periodically) which calls this pool frequently. These monitoring processes would pile up until all memory is exhausted which is essentially a memory leak and will result in the whole BEAM shutting down.
Poolex is not handling worker start errors and whole pool will die with MatchError if some extra worker fails to start. It makes this pool inapplicable for usage in environments where workers connect to external services (like databases or HTTP services which are most pooling use-cases) which can be unavailable or return 422 for example.
Starting and stopping supervisor manually is a strange approach. I’d suggest to start one_for_all supervisor which has Poolex and Dynamic as it’s children instead of starting a Dynamic supervisor as a direct link to Poolex. This would work faster in application stop and would disable the unnecessary trap_exit flag

Also I’d like to mention that this approach of BusyWorkers and IdleWorkers module which manage the same structure is uncommon (I personally see it for the first time) but very nice to read and it makes it really easy to follow the algorithm.

I’ve found that you work in Авиасейлс and there’s a chance that you use this library in production. If you want an expert review of your other solutions, codebase and development practices to find more issues like I did just now, please leave me the message, my rates are low.

general-CbIC · November 12, 2024, 11:07am

Hi, @Asd!
(@hst337 ? )

First of all, thank you very much for reading the code and describing several problems you found! Writing a project without a code review was quite tricky, and I no longer saw the issues you wrote about.

Why did you name the option pool_id instead of more common name for name registration?

Initially, I used only atom() as the first parameter since I did not see the need to complicate it. I hadn’t thought about using Registry, and it turns out that in my years of working on Elixir, I’ve never had to use anything other than atoms. In general, when a pool was intended to have a unique atom as its identifier, pool_id was an appropriate name. Most likely, I made a mistake by not changing the name of this option when supporting the GenServer.name() type. But I don’t see any great criticality in this. I will make an alias :name, deprecate the old key :pool_id, and then slowly migrate it.

poolboy uses high priority for the pool owner genserver process. Why didn’t you implement the same here?

Unfortunately, I didn’t understand what we were talking about. Can you please describe it in more detail?

Why do you use Agent for monitor references storage. It seems that it is redundant and just decreases the performance of the whole pool (I am preparing a PR with change to plain map).

You’re right! Thank you! It seems to me that there was a similar reason in one of the previous implementations because not in all cases, when working with monitoring, I had a State available. It looks like there are no more reasons, but I forgot to check this and remove the public storage for monitoring.

I’ll be waiting for your PR

There’s a very bad bug with monitor_caller function which spawns a monitoring process for every run call what goes against the idea of having a pool of processes in the first places (since you end up spawning process for every call anyways).

These new processes are very lightweight and not linked to the caller process. In any case, we need to monitor the caller, and I don’t yet understand why a process that waits to see if the caller will crash is not suitable for this task.

Overall, following checkout/checkin pattern would solve the problem.

It seems that this is excessive control over the execution logic on the call side. When using a pool, you should just perform some operation on it instead of juggling worker processes. What do you think?

Plus, these processes will live forever while the caller is alive.

I’m afraid that’s not right. The monitoring process is always killed after the worker is released.

Pay attention to this line: poolex/lib/poolex.ex at develop · general-CbIC/poolex · GitHub

Please tell me where I’m wrong if I don’t see something.

Poolex is not handling worker start errors and whole pool will die with MatchError if some extra worker fails to start. It makes this pool inapplicable for usage in environments where workers connect to external services (like databases or HTTP services which are most pooling use-cases) which can be unavailable or return 422 for example.

This is an excellent observation. Thank you very much! I focused on carefully handling errors from an already running worker and forgot about controlling their launch.

Starting and stopping supervisor manually is a strange approach. I’d suggest to start one_for_all supervisor which has Poolex and Dynamic as it’s children instead of starting a Dynamic supervisor as a direct link to Poolex.

I didn’t understand
Why is this behavior strange, and how is it conceptually different from general supervisory behavior?

general-CbIC · November 12, 2024, 1:59pm

I missed this question. As far as I know, Elixir does not have private modules, and there is no way to limit module accessibility. I added the Private space to clarify to the developer that he is doing something wrong.

Asd · November 12, 2024, 3:54pm

First, the difference in application stop. When supervisor will stop Poolex process, this process will have some time to handle some messages and then terminate (if it has a trap_exit flag) and only then the DynamicSupervisor will start terminating. Usually, Supervisors give some time for children to terminate and then, when some child is terminating for too long, Supervisor kills it. With spawning link to DynamicSupervisor directly, it would have no such timeout and the Supervisor of Poolex won’t even know that there’s some DynamicSupervisor which is terminating, therefore DynamicSupervisor won’t have time to terminate workers. If these workers were HTTP connections acceptors, they would just die, while they could terminate gracefully with controllable timeout.

Second, the difference in fault tolerance. Right now if Poolex dies, all workers die with it. However, it is possible to just restart Poolex without restarting workers if dynamic supervisor and Poolex were started under the rest_for_one supervisor. Poolex would then just initialize it’s state and monitors from which_children of DynamicSupervisor.

And you can monitor it in the Poolex process without spawning extra one. This is what this checkout/checkin pattern is about: you store an association between a caller and a worker once worker is found and when caller or worker dies, you release the alive one.

I was talking about high priority for the dispatcher/manager process (Poolex process in your case) so that it checks out processes faster, since it manages the queue of work on it’s own and there’s no need to keep the messages in the messagebox. I thought that poolboy uses it, but it does not, while a lot of other pools with similar architecture do (like hackney, lhttpc). But it would still make sense to use it in Poolex

True, I am wrong, I missed this line completely.

That’s another user, not me.

But developer can use the Private.DebugInfo and Private.Metrics modules, right?

And I also found one more bug. Consider this scenario:
Caller gets a worker, sends a long-running job to the worker, then caller dies and Poolex just returns this worker to the idle queue. In this case the next caller, can receive this worker which is still executing the long-running job and this caller won’t be able to execute anything with this worker. So I have a feeling that it might make sense to just restart the worker when caller dies before releasing this worker

sodapopcan · November 12, 2024, 5:13pm

LOL, I miss @hst337.

The “official” way to mark a module private is to add @moduledoc false, that way they don’t show up in the documentation (I can find a citation if needed). Looking through your Private namespace none of the moduledocs are very involved and could just be comments. Though I don’t think it’s a big deal either way, just pointing it out. Earmark uses a similar approach with Earmark.Internal.

general-CbIC · November 14, 2024, 11:08am

If I understand correctly, you propose changing the circuit, as shown in the figure below. It seemed that if the Poolex process receives an EXIT message, it starts executing the terminate() callback after a short time. The first step in this callback is to turn off DynamicSupervisor: poolex/lib/poolex.ex at develop · general-CbIC/poolex · GitHub.

On the Application Supervisor side, both trees are represented through a single input process: Poolex or Poolex Supervisor. I think that killing the input process will result in the same termination logic in both trees.

I completely agree with this. However, it is unclear where you can dump the state with information about which workers are busy and which are not.

general-CbIC · November 14, 2024, 11:19am

I need to think about it.

Oh, cool! I will check this out

Yep. And I can’t do anything with it. I can only say that it is for internal library usage. I love Elixir, but unfortunately, it doesn’t have a feature to hide library interfaces.

Thank you very much! I will add it to issues to not forget

However, I want to document private modules and functions for “future-me” or other Poolex contributors. Hexdocs is not all because developers still get access to each module’s public interfaces even if they aren’t documented.

ruslandoga · November 14, 2024, 12:28pm

you store an association between a caller and a worker once

I learned recently that it might not be necessary to store the associations in the monitoring process since we can provide tags to :erlang.monitor which would be included in the DOWN message. And since the caller gives ref and resource back on checkin, we don’t need lookups there either.

def checkout(pool, callback, timeout) do
  {ref, resource} = GenServer.call(pool, :out, timeout)

  try do
    callback.(resource)
  after
    GenServer.cast(pool, {:in, ref, resource})
  end
end

# assume no queue
def handle_call(:out, {_ref, pid}, state) do
  # ... queue handling and stuff ...
  [resource | resources] = state.resources
  ref = :erlang.monitor(pid, tag: {:DOWN, resource})
  {:reply, {ref, resource}, %{state | resources: resources}}
end

def handle_cast({:in, ref, resource}, state) do
  # ...
  Process.demonitor(ref, [:flush])
  {:noreply, %{state | resources: [resource | state.resources]}}
end

def handle_info({{:DOWN, resouce}, _ref, ...}, state) do
  # ...
  {:noreply, %{state | resources: [resource | state.resources]}}
end

Disclaimer: I didn’t read the whole thread or the Poolex code, but thought this little bit about monitors might be useful Sorry if it’s completely irrelevant to the discussion!

ruslandoga · November 14, 2024, 3:13pm

And maybe cleanups like GenServer.cast(pool_id, {:cancel_waiting, caller_reference}) can be replaced with process aliases (available since OTP-24+, Elixir 1.15+) and the caller queue can maybe be replaced with just the message queue! Reading the docs a bit more carefully, it might not work. I’ll need to try out in a project tomorrow.

Either way, it’d be really cool if Poolex used all the modern features of the Erlang VM! Then it would be another good reason to use it over Poolboy