I haven’t been able to do any elixir for a while since I had a baby and we haven’t had any new elixir projects at work for a while. So, I want to create something small as a hobby project I’ve seen a lot of buzz here about Sasa’s blog post about to spawn new processes or not which lead me to this thread.
What I’m thinking about doing is just to create an app which checks if my sites are up and working. So, I can add information about a site, what URL to ping, content to check for, how often to check etc. All good so far
My initial thinking was to just spawn a process (GenServer) for each site that I add which sends itself a message with Process.send_after or similar to know when it’s time to check that particular site. Now I was thinking about it though and I can just create one process that does this for all sites I guess. Would it be appropriate here though?
With a single process there is no need to manage a bunch of processes (using Registry?) which makes things a bit easier. However, if the process crashes, it would affect all the sites, which is undesirable. Also, if a website check takes a long time it would also affect other sites, if their schedule would clash (I’ve no need for the checks to be exactly on time, but it would be nice to not have them affect each other I guess).
Any thoughts? (All this buzz about not spawning unnecessary processes and avoiding databases makes me worried ;))
IMHO, your use case is exactly when you should use processes. You have things that
Can be run concurrently
Have no interdependence
In the numerical computing world this is the “embarrassingly parallel” problem. If you have one of these Elixir process are a huge win. Unfortunately most concurrency problems don’t fit in that box.
Where it gets tricky is when there is data dependence across processes. I don’t think there is any universal rule that can be applied in that case, but too many processes can be just as bad as too few. A lot of people coming into Elixir from other environments have misconceptions about Elixir processes and are hesitant to use them. This has created a surge of “processes for everything” posts/examples to help people adapt.
However, BEAM processes are cheap, but they aren’t free. There definitely is a line where processes for everything makes the code harder to work with not easier. Finding where that line is for your problem is the hard part. The BEAM and Actor model solves a lot of the simple technical problems with concurrency, it doesn’t solve the hard architectural ones.
In your described case, I would spin up a GenServer for each site for the reasons you list. Your GenServer can read the last know state from the database, check the site and update the db if it changes, caching the state. So, you only need to update the database and take action when the state changes. This presume you are doing some kind of notification with the state changes.
If the process fails for some reason, your GenServer sill automatically get restarted and only need to do 1 db query and 1 site poll.
If you were using only one GenServer for all sites, then you would have have x database hits and x site polls upon restart.
This should scale to 10’s of thousands of sites on a single server IMHO.
Only “a bit”, so I suggest not even thinking about the “easy” aspect. As I tried to make the argument in that blog post, use processes if they bring some runtime benefits.
This is a great analysis, and I believe the conclusion would be to use multiple processes
Clearly they will bring many tangible benefits. In the case where you had to deal with a larger number of sites, you might even want to consider having an extra pool, which would limit the amount of sites you’re checking at the same time. So you might end up with N + M processes, N being the number of sites, and M the size of the pool (probably much smaller than N). And of course, a few supervisor processes would be needed as well.
So in the post, I wasn’t advising to be conservative with processes. I usually say in my talks that larger systems will easily reach for hundreds of thousands, maybe even millions of processes. There are many benefits which you gain if you split the total work over a large number of processes. So if you see potential for concurrency, definitely go for it.
However, using processes to simulate objects, and separate logical concerns in a sequential problem, will not get you anywhere. It will bring some problems, but you’ll get no benefits of processes.
I also recently posted a bit about this topic here.
Also, for this particular problem, you can consider some cron library, such as quantum. You can install your jobs with the quantum gen server, and then they are executed in regular intervals, with each job running in a separate task.
Sounds like a good plan from a learning perspective. Its a pretty straight forward pattern to implement in Elixir and provides lots of learning potentiation; Process timing APIs, handle_info vs handle_cast in GenServer, etc.