GenServer blocking in async task

nikody · June 25, 2020, 1:49pm

I have something similar to the following code:

    genserver_name = "cool_name"

    task = Task.async(fn -> MyModule.start() end)

    :timer.sleep(500)

    GenServer.cast(genserver_name, {:save, first_payload})
    GenServer.cast(genserver_name, {:save, second_payload})
    GenServer.cast(genserver_name, {:save, third_payload})

    response = Task.await(task)

MyModule.start/0 is a normal function (MyModule is not a GenServer) that starts a GenServer with the name of "cool_name" (the name iss always different, but I don’t think this is important for the problem).

Additionally, after starting the GenServer, MyModule.start/0 is blocked and waits for the GenServer to receive messages from other processes. It checks the GenServer state every second with a GenServer.call call and if it has the expected data, it unblocks and continues with processing it.

It normally receives the messages that update its state as expected in a few minutes, but I wanted to write a test for it, so I used the aforementioned code.

What I find strange is that if I remove the :timer.sleep/1 in the code above, the Task.await/1 times out and the GenServer.cast calls don’t get called at all and the GenServer state is always empty, the MyModule.start/0 function checks it every second with a GenServer.call and it is always empty. If I add the :timer.sleep/1, it is empty, for instance, only on the first check, then the casts run, the state gets populated, and everything is fine, the Task.await/1 does not raise an exception.

I wonder what is the reason for this blocking without the :timer.sleep/1 and why the addition of the timer solves it. I’m guessing it has something to do with the GenServer.call that checks the state every second, but I don’t think I can find the explanation with my current knowledge - it is called once every second and immediately returns the state (which in this case is just an empty map).

I am running this in an ExUnit test, but I think this is also not related to this specific issue.

jtsummers · June 25, 2020, 4:32pm

Because you’re spawning a task that then spawns the server you’re hitting a race condition. The named server doesn’t exist yet. Rather than using a timer (which either wastes time or cuts it too close and has a chance of failure), you can try GenServer.whereis. If it returns nil then the named server hasn’t been started yet. You could also have the spawned task send a message back to this process signaling that it has started the server, then use a receive in this process to wait before casting. That last one is what I would probably do.

nikody · June 25, 2020, 4:34pm

Ah, how didn’t I think of that! Thank you!

jtsummers · June 25, 2020, 4:58pm

No problem. I recommend drawing out problems like this. Here is my approximation of what’s happening as an image:

SoWkIImgAStDuIe0qfd9cGM9UIKAkOcPUQW2-NcP9Vb5A6B5gKLbgKLSN2352hfsS3M9oIMPPOabgJ01c88AY0x2VAW-e114164aZ9-I2XD1CSC6xWfa9N0XXzIy5A1F0G00

The blocks in each column represent the processes becoming active (their lifetimes). The casts may reach the server without the delay or a blocking receive, but most likely this is how it’ll play out.

In case you’re curious, here’s the plantuml source for it:

@startuml
participant main
participant monitor
participant server

main -> monitor
activate monitor
main ->x server
main ->x server
main ->x server
monitor -> server
activate server
deactivate server
main <-> monitor
deactivate monitor
@enduml

dimitarvp · June 25, 2020, 5:04pm

Which tool do you use for these diagrams?

jtsummers · June 25, 2020, 5:08pm

PlantUML. There’s a nice live preview in VS Code with it via the PlantUML extension by jebbs. I also make these sorts of diagrams (when I want more freeform charts) with Graphviz and use the Graphviz extension by João Pinto for live previews of those.

ityonemo · June 25, 2020, 8:59pm

One question that left me scratching my head: How were you able to get a GenServer named with a binary? Typically names are only allowed to be atoms, {:global, term}, or {:via, module, term}.

I feel like there’s something unidiomatic going on under the hood which is causing these issues. If you don’t mind a few suggestions:

Unless you have a really really really good reason to, you should typically be launching a GenServer from a function with the start or start_link of its own module. You might not do it if you have some sort of abstraction generator (I really don’t want to say “factory”, because, yuck).
Whether or not it’s coming from its own module, Conventionally a start or start_link function should probably output something that looks like {:ok, pid}, which you can then use to do your GenServer.cast/calls
Generally my rule of thumb is “you shouldn’t be using GenServer.cast unless you really know what you’re doing”. as you see here casts will silently fail; if call times out or drops an error, that might be saying something about your plans. When you deploy to prod, a failing call will be quite noisy and so you can track those and fix your code (or not, if it’s rare/innocuous/expected enough not be worth the effort)
GenServers are fundamentally an async- sort of thing, so wrapping a spawning a GenServer inside a Task.async in a test seems like a red flag; the test should probably be launching the GenServer itself. The pattern gives you a lot of power; start and start_link block on the completion of init, and should contain bootstrapping bits that live in the calling process; and init contains the bootstrapping bits that live in the child process. If you need to unblock the caller early before bootstrapping is done, and some bootstrapping things on the child process’ time, you can issue a :continue directive. There’s also tools like Connection library for other special cases.

Not to be too prescriptive, but It seems like you should be able to refactor your modules to conform to these parameters.

lud · June 26, 2020, 10:44am

Also genserver_name is not given to MyModule.start() in the example.

nikody · June 26, 2020, 11:31am

@lud @ityonemo Thank you for your responses!

@lud yes, you are right, but it knows the correct name. In the actual use case, this module receives various parameters and generates the GenServer name based on them. I just wanted to simplify the example and didn’t specify this, I’m sorry about that. Then, the other processes that update the GenServer state (which are actually controller actions), also receive a parameter that lets them know the name of the GenServer.

@ityonemo your observations are correct. In simplifying the code for my question, I made a few errors. We use {:global, "different_id"} for the GenServer name, and the GenServer itself is started with ServerName.start_link({:global, "different_id"} and in start_link/1 it has GenServer.start_link(__MODULE__, %{}, [name: name]). I hope this answers the first half of your response.

About the call and cast - I think you have a point. It would be safer with call, there is no concrete reason to use cast I think, although so far we have had no problems with it. I did not originally write this functionality, but I am trying to refactor it at the moment, and this suggestion is good, thank you!

As for the test itself, well, the case itself is a bit strange. I want to the test the module that starts the GenServer, not the GenServer itself. So in the test I call the module and it starts the GenServer, but as I pass the parameters to the module, I know the GenServer name.

Normally, this module waits for HTTP requests to an endpoint in the same application - the GenServer.cast calls are made in the controller once per request. Once all requests have been received, the GenServer’s state is complete and the other module unblocks and continues processing these responses. With these casts in the test I try to simulate that. Actually, it might be better if I mock the requests themselves and call the controller in the test, but I don’t find it necessary as I’m not aiming for an integration test, just a test for one specific module.

I know the blocking part while waiting for the GenServer state to get update from elsewhere might seem strange and has a downside, I will probably refactor this too at some point, but it will requite a lot of changes to other parts of the application too, that is why for the time being I want to test the current functionality.

In addition, @jtsummers, thank you fro the diagram and the suggestions!

ityonemo · June 26, 2020, 3:01pm

Cool! Just a few more things. Beware that {:global, id} is a bit dangerous in that if you ever go to distributed beam, startup of the process will require a full distributed transaction lock (with a not very fast consensus protocol, aka, not even Paxos/Raft IIRC) to make sure it’s “truly global”. That may not be what you want, and so Registry is the preferred way to have named GenServers in Elixir.

It seems like what you’re trying to refactor is classic “overuse of GenServers” by someone who thought of them as “objects” from an OOP world. Classic warning here in the docs: https://hexdocs.pm/elixir/GenServer.html#module-when-not-to-use-a-genserver. Generally my feeling is if you have a transient GenServer it’s probably not right (unless it’s a Connection).

I might structure it this way:

task = Task.Supervised.async(fn -> MyModule.do_the_things() end)
send(task.pid, {:save, first_payload})  # <- should probably also be refactored to MyModule.save(first_payload)
send(task.pid, {:save, second_payload})
send(task.pid, {:save, third_payload})
Task.await(task, a_reasonable_timeout)

where SomeModule.do_the_things() looks like:

def do_the_things() do
  first_payload = receive do {:save, first_payload} end
  ...
  second_payload = receive do {:save, second_payload} end
end

Although it’s not using call, you get process linkage safety, because Task.Supervised.async/1 explicitly links two processes, where if the spawned function dies the caller dies and vice versa.

nikody · June 26, 2020, 3:14pm

Thank you for the suggestions again! I haven’t used Task.Supervised yet, I’ll look into it. One possible complication might come from the fact that payloads in my case are saved in the state of the aforementioned GenServer from different processes. For instance, if I have 3 payloads, they are received by HTTP requests to the application and are saved individually from a controller that calls the GenServer. I believe that as it stands now the GenServer is used because by naming it with a global name it can be called both from MyModule and from the controller when it receives the payloads. It’s a strange case.

But I think it is possible to refactor the process by using a Task if it can be referenced from another process. This would probably mean that its pid should be save somewhere that is accessible from any process, which I guess could be an ETS table.

ityonemo · June 26, 2020, 9:31pm

If you need a canonical name, use registry it’s exactly what you propose (ets table). And you can certainly register a task with registry.