What would be best practices for settings up Genserver in this scenario

Hi I would like to discuss about a couple of scenarios to setup an initialize a genserver.

The thing I am trying to setup is like this.

  • I want to initialize a genserver under supervision tree that caches data from another source (in this case, retrieve data using network call and save it in genserver’s ets)

There are two scenarios that I would like to discuss about this setup -

  1. In this setup (let me call it setup_without_initialization), I would put a genserver under supervision tree and in it init block, i will only create ets without initialize it. And then I will have another genserver_call that will be called later to initialize it. And I will also maintain a state in genserver that says it’s ready to be fetched, it’s not initialized, or data could not be fetched.

  2. In this setup (let me call it setup_with_initialization). I would put a genserver under supervision tree, and then i will initialize by fetching data, and store it in ets also. (inside init block)

From the two setups, what would be right setup or recommended setup that the community recommend.

Personally I think the setup setup_with_initialization should be the right way of doing.
Reasons -

  1. This will save us from possible race conditions - (if you see in the first scenario, the genserver needs to wait for another call to initialize the real data. (in the real world scenario, in truly concurrent world, we can’t guarantee that the call to fetch data from ets always come after the call to set data in ets)
  2. We can eliminate the unneeded handle cases (for ex. data_not_fetch, fail_to_fetch and more) by the callers.
  3. If the genserver fails to initialize, the supervision will make it respawn and once it’s initialized it ready to accept request. (we might have to adapt the number of retry in the supervision config so that the supervision will not fail the whole app.)

Please let’s me know your thoughts on which one you would recommend -
I would like to know what is the community recommendation.

There are a few flaws in your conclusions:

  • There are no race conditions within the GenServer. So if ETS doesn‘t yield a conclusive answer you can always fall back to a synchronous call to the GenServer.
  • You cannot get rid of the problems a network connection imposes on you and you likely won‘t want to use the supervision tree to handle it‘s issues. Restarting cannot fix an unavailable network resource, and you already know about that, so it should not be a concern for a supervisor.
1 Like

Yes there are no race condition with in the server.

But in this context, the request for fetching data could come before setting data.

Sure, but then you can queue in to the genserver requesting the data be fetched. Or os the data not fetched by the genserver itself?

Also you‘ll need to handle the „empty result“ case anyways. What if your app starts up and you cannot connect to this network resource?

Yes but how can you guarantee that the first call that come to this genserver.
is the fetch_and_set_value_of_ets

Given a call and casts can be coming from any where and from any process…

Do you mean by,

let requests come. We save the requests in queue or some thing.
and then when data is set. we then see the queue and reply back.

Exactly. Your system should not be brought down because of a failed network call. Certain sub-systems may hold processing. But your application can’t assume everything will always work as it should.

This is a foundational idea in Erlang / Elixir. Handling errors and faults is a first class citizen.

This is why we see this pattern or many other variations of the same thing in such systems.

case do_something() do
  {:ok, :all_good} -> do_another()
  {:error, :oops} -> hold()
end

If some process makes a call to the gen_server and it doesn’t have the correct data it should say so. and the caller should handle the situation accordingly.

1 Like

I wouldn‘t built it that way. If the genserver is responsible for the data it should be able to retrieve said data. The first call, would make it do that and later ones can just be fed the same response until all calls are fulfilled (the ones queued until ets was populated)

1 Like

@zacksiri @LostKobrakai

In the second case above that i am presenting, the whole system is never bought down.

this genserver is never bought down, because it is supervised. [with high no. of retry per duration]

but, by following the first case, we are passing the responsibility to caller that you are calling before the state is set try refetch again.

and for the case that @LostKobrakai suggested, i am assuming what i understand is correct.

we kept the calls come in and do not reply yet, until the data is set, by making genserver maintain another queue that, once it is ready, the genserver start replying to those requested queue

@blisscs in that case you are causing your system to block. Which may put your system into a deadlock. Since many other parts may depend on the caller of that gen_server.

I disagree with building a system that blocks because it is waiting for a network resource.

I believe systems should be asynchronous and if something can’t happen in that moment because something failed then that thing doesn’t happen. Instead you should design the system to be able to pause / resume when said resource is available.

You should make your system be able to handle failure, and recover itself.

If you want to save your system for failed network calls you have to see what kind of data it is, if it’s something that rarely changes you should store it in some kind of persistence and when the system comes online it will attempt to make the call to the source but if it can’t it can read it from some persistence layer. If you want to ensure maximum availability of the system.

It really depends on how you design your system. But having a call wait until result comes can put your system into a deadlock.

Yes but receiving a call before it’s ready. is also another issue here.

it’s like we are fetching a cache, without warning up a cache.

So make it ready with minimal effort don’t put things in init that can cause your GenServer to fail. If you put your GenServer into the Application tree it will be guaranteed to start.

So your GenServer will always be available when the app boots. So there will never be a call where the GenServer isn’t available. If you force your genserver to make a successfull network call THAT will increase the chance for your system to fail. GenServer should boot up, make a failable network call in a TaskSupervisor or something using async_nolink and just let it be.

Also if the cache is not ready, then it’s not ready, then it should tell the caller “I’m not ready”, the caller should be able to handle that. What it should definitely not do is block and get stuck and become unavailable to even tell the caller that it’s “not ready”

1 Like

I have already mentioned that, but I’ll say it again: A network resource being unavailable is not something a process restart can fix and therefore is a bad fit for being handled by crashing and depending on the supervision tree. You should properly handle that failure case, which does stop you from the churn of constantly crashing and restarting a process, which does not help anybody. Your API around your cache needs to be able to handle all the possible cases, which might include “we just started up, the network resource is unavailable, we don’t have data yet”.

This is totally unrelated to if you warm up your cache in the GenServers init callback, or if you retrieve the data on first demand.

2 Likes

@LostKobrakai Agreed!

Thank you @zacksiri and @LostKobrakai for your thoughts.

i am very appreciated.

to summarised the discussion -

  • A genserver initialization should always be free of failures.
  • Given a scenario of using genserver as a cache, the caller should be responsible for cases of cache not fetch cases also.


I got another question here, in the case of fetching and storing data in ets. should we send a another message call/or cast to this genserver to do the fetching_and_updating a cache.
Or should we fetch it from somewhere else and then pass that message in with call or cast message, since network call can cause this genserver to fail.

Given that a network call could make this genserver fails again.


Another scenario to this that i could think of is that. if we go with the case that i am presenting,
in the init block instead of making a network directly, we spawn a process to do a network call for us. This process either will be timeout or reply with the result, or die. In case this process is timeout or die we set genserver state to cant_fetch_cache. In the case that we get the response. we set the data in ets.

I think by doing this - it would eliminate my concern of getting a call or cast on genserver, when the genserver is not ready to answer, and will eliminate the state of data_not_fetch.

That’s not a good generalization. It surely can fail, but it should only do so, if a restart of the genserver or any upstream supervisor might fix the issues and/or you want the app to be stopped if it can’t be resolved.

You can always generalize the case of “no data yet” and “couldn’t retrieve data” no matter when you actually query for data. To the caller the result is most likely the same, because the net result is “no data”.

The only difference between retrieving data in init or “on demand” is how long the first caller needs to wait for any result. If the data is not retrieved in response to the first call but e.g. by a third party, that’s the same as retrieving in init, having no result and later retrying. Therefore falls more in the first option and not the second.

You should not mix responsibility between the Process that is responsible for updating the cache and the cache itself.

I believe it’s better to have another process responsible for ensuring the cache is up-to-date and also handle cases where there are failures in the network.

Also I would put the network call inside a Task or a TaskSupervisor do an async_nolink on it and if it comes back success great it updates the cache if not it leaves the cache alone.

The process that manages the updating should not die because of a failed network call.

     ^ 1.)
     | async call (Task.Supervisor.async_nolink)
     |
-------------- 2.)update ---------             -----------
| Updater    |---------->| Cache |<-----------| Consumer |
--------------           ---------             -----------

Consumer will get {:ok, result} or {:error, :not_found} or {:ok, nil}

however you choose to handle it. 

The Cache and Updater will start at the application start successfully since it’s not doing anything that will make it fail. SO it will always be able to respond to cast/call

@LostKobrakai @zacksiri

After researching more about it, I found a similar discussion about having an expensive init(the original discussion doesnt talk about the failing of init function), that discussion there also, includes about a race conditions a genserver is having when been called without initialization.

the discussion can be read from here https://groups.google.com/forum/m/#!topic/elixir-lang-core/fLdVQDZcFo0

In the discussion there are words like post_init and delay_init and :gen_statem that are proposals and could be implemented for expensive init.

Eventhough, in that discussion there is no final solution. But I think the main cause of the problem in that discussion is conceptually same as what i presenting here.

Which is why your init callback should not be expensive. Also the Application module ensures all the processes in the Application supervision tree is started before the app is considered “started” there is a reason why there are functions like “ensure_all_started” in the Application module.

I feel the concern of the thread is unfounded. If a GenServer takes long to start it will eat into the boot up time. It might be worth it for you to read about the Application module in elixir.

https://hexdocs.pm/elixir/Application.html#ensure_all_started/2

Also there may be a reason why they thread has had no reply since 2016

I believe if you have strong understanding of how supervision trees work these concerns will be gone.

I’ve built many kinds of applications, and from experience it has taught me what to do and what not to do in an init callback. When to use a supervisor, when to use a dynamic supervisor. Etc…

2 Likes

I think if it is necessary to have that data inside ETS before other process to run, you kinda create mini database as dependency.

So you can compare that to how typical web application handle SQL services (PGSql, MySql, etc.).

  • We make sure SQL is up and run before we start web app. I think it is very rare for people to think about “What if web application start before SQL start?”. We just make sure that we handle the starting order properly. SQL must be available before web application run.
  • When SQL service is down, we told Nginx or load balancer to put “Server down” page in front of web application.

So I would say that the simplest to handle this is to guarantee startup order in application startup function.

Start ETS -> Fetch data to ETS -> Service Ready -> Start consumers

And then you make sure that if Cache service die, all consumers also become unreachable. It just like how we normally put Server down page when SQL server that backed web application is dead.

If you can guarantee the startup order and guarantee that when this service down or ETS is not ready, it takes whole consumer down, then race condition is simply not possible. This make things much simpler.

How to handle output to user? Well, you need another layer to provide output on top of consumer. You can think of how we can put “Server down” page on top of Nginx or Load balancer. You will need something like that, a layer on top of consumer.

This complicated things a little bit, but IMO it is much easier to deal with this extra layer that monitor readiness of internal service rather than deal with race-condition.

1 Like