Hi I would like to discuss about a couple of scenarios to setup an initialize a genserver.
The thing I am trying to setup is like this.
I want to initialize a genserver under supervision tree that caches data from another source (in this case, retrieve data using network call and save it in genserver’s ets)
There are two scenarios that I would like to discuss about this setup -
In this setup (let me call it setup_without_initialization), I would put a genserver under supervision tree and in it init block, i will only create ets without initialize it. And then I will have another genserver_call that will be called later to initialize it. And I will also maintain a state in genserver that says it’s ready to be fetched, it’s not initialized, or data could not be fetched.
In this setup (let me call it setup_with_initialization). I would put a genserver under supervision tree, and then i will initialize by fetching data, and store it in ets also. (inside init block)
From the two setups, what would be right setup or recommended setup that the community recommend.
Personally I think the setup setup_with_initialization should be the right way of doing.
This will save us from possible race conditions - (if you see in the first scenario, the genserver needs to wait for another call to initialize the real data. (in the real world scenario, in truly concurrent world, we can’t guarantee that the call to fetch data from ets always come after the call to set data in ets)
We can eliminate the unneeded handle cases (for ex. data_not_fetch, fail_to_fetch and more) by the callers.
If the genserver fails to initialize, the supervision will make it respawn and once it’s initialized it ready to accept request. (we might have to adapt the number of retry in the supervision config so that the supervision will not fail the whole app.)
Please let’s me know your thoughts on which one you would recommend -
I would like to know what is the community recommendation.
There are no race conditions within the GenServer. So if ETS doesn‘t yield a conclusive answer you can always fall back to a synchronous call to the GenServer.
You cannot get rid of the problems a network connection imposes on you and you likely won‘t want to use the supervision tree to handle it‘s issues. Restarting cannot fix an unavailable network resource, and you already know about that, so it should not be a concern for a supervisor.
I wouldn‘t built it that way. If the genserver is responsible for the data it should be able to retrieve said data. The first call, would make it do that and later ones can just be fed the same response until all calls are fulfilled (the ones queued until ets was populated)
@blisscs in that case you are causing your system to block. Which may put your system into a deadlock. Since many other parts may depend on the caller of that gen_server.
I disagree with building a system that blocks because it is waiting for a network resource.
I believe systems should be asynchronous and if something can’t happen in that moment because something failed then that thing doesn’t happen. Instead you should design the system to be able to pause / resume when said resource is available.
You should make your system be able to handle failure, and recover itself.
If you want to save your system for failed network calls you have to see what kind of data it is, if it’s something that rarely changes you should store it in some kind of persistence and when the system comes online it will attempt to make the call to the source but if it can’t it can read it from some persistence layer. If you want to ensure maximum availability of the system.
It really depends on how you design your system. But having a call wait until result comes can put your system into a deadlock.
So make it ready with minimal effort don’t put things in init that can cause your GenServer to fail. If you put your GenServer into the Application tree it will be guaranteed to start.
So your GenServer will always be available when the app boots. So there will never be a call where the GenServer isn’t available. If you force your genserver to make a successfull network call THAT will increase the chance for your system to fail. GenServer should boot up, make a failable network call in a TaskSupervisor or something using async_nolink and just let it be.
Also if the cache is not ready, then it’s not ready, then it should tell the caller “I’m not ready”, the caller should be able to handle that. What it should definitely not do is block and get stuck and become unavailable to even tell the caller that it’s “not ready”
I have already mentioned that, but I’ll say it again: A network resource being unavailable is not something a process restart can fix and therefore is a bad fit for being handled by crashing and depending on the supervision tree. You should properly handle that failure case, which does stop you from the churn of constantly crashing and restarting a process, which does not help anybody. Your API around your cache needs to be able to handle all the possible cases, which might include “we just started up, the network resource is unavailable, we don’t have data yet”.
This is totally unrelated to if you warm up your cache in the GenServers init callback, or if you retrieve the data on first demand.
A genserver initialization should always be free of failures.
Given a scenario of using genserver as a cache, the caller should be responsible for cases of cache not fetch cases also.
I got another question here, in the case of fetching and storing data in ets. should we send a another message call/or cast to this genserver to do the fetching_and_updating a cache.
Or should we fetch it from somewhere else and then pass that message in with call or cast message, since network call can cause this genserver to fail.
Given that a network call could make this genserver fails again.
Another scenario to this that i could think of is that. if we go with the case that i am presenting,
in the init block instead of making a network directly, we spawn a process to do a network call for us. This process either will be timeout or reply with the result, or die. In case this process is timeout or die we set genserver state to cant_fetch_cache. In the case that we get the response. we set the data in ets.
I think by doing this - it would eliminate my concern of getting a call or cast on genserver, when the genserver is not ready to answer, and will eliminate the state of data_not_fetch.
That’s not a good generalization. It surely can fail, but it should only do so, if a restart of the genserver or any upstream supervisor might fix the issues and/or you want the app to be stopped if it can’t be resolved.
You can always generalize the case of “no data yet” and “couldn’t retrieve data” no matter when you actually query for data. To the caller the result is most likely the same, because the net result is “no data”.
The only difference between retrieving data in init or “on demand” is how long the first caller needs to wait for any result. If the data is not retrieved in response to the first call but e.g. by a third party, that’s the same as retrieving in init, having no result and later retrying. Therefore falls more in the first option and not the second.
After researching more about it, I found a similar discussion about having an expensive init(the original discussion doesnt talk about the failing of init function), that discussion there also, includes about a race conditions a genserver is having when been called without initialization.
Which is why your init callback should not be expensive. Also the Application module ensures all the processes in the Application supervision tree is started before the app is considered “started” there is a reason why there are functions like “ensure_all_started” in the Application module.
I feel the concern of the thread is unfounded. If a GenServer takes long to start it will eat into the boot up time. It might be worth it for you to read about the Application module in elixir.
I think if it is necessary to have that data inside ETS before other process to run, you kinda create mini database as dependency.
So you can compare that to how typical web application handle SQL services (PGSql, MySql, etc.).
We make sure SQL is up and run before we start web app. I think it is very rare for people to think about “What if web application start before SQL start?”. We just make sure that we handle the starting order properly. SQL must be available before web application run.
When SQL service is down, we told Nginx or load balancer to put “Server down” page in front of web application.
So I would say that the simplest to handle this is to guarantee startup order in application startup function.
Start ETS -> Fetch data to ETS -> Service Ready -> Start consumers
And then you make sure that if Cache service die, all consumers also become unreachable. It just like how we normally put Server down page when SQL server that backed web application is dead.
If you can guarantee the startup order and guarantee that when this service down or ETS is not ready, it takes whole consumer down, then race condition is simply not possible. This make things much simpler.
How to handle output to user? Well, you need another layer to provide output on top of consumer. You can think of how we can put “Server down” page on top of Nginx or Load balancer. You will need something like that, a layer on top of consumer.
This complicated things a little bit, but IMO it is much easier to deal with this extra layer that monitor readiness of internal service rather than deal with race-condition.