I’m new to Elixir and kind of new to the whole architecture of separate applications communicating together.
I’m working on a Crawler of some webpages, I need to scan whole web and store results into database.
This sounds like an excellent use case for GenStage, or perhaps even Flow which is a streamlined and simplified means to utilize GenStage in your project.
Basically, they provide this “chain of producers and consumers, with parallelization” pattern in a library.
If you do want to roll the whole thing yourself, or have already discarded GenStage as an option for one reason or another, you can of course manage this yourself, too. One way to do this would be to have a Dynamic Supervisor that spawns a supervisor per crawl job. That spawned supervisor would then spawn off the set of required workers to perform a given crawl job, perhaps using start_child directly which returns the pid of the child. The crawl supervisor could then collect those pids and pass the relevant pids around to the other children.
For instance, it could start the results manager using start_child, get its pid, and then when it starts the fetchers, pass in the pid of the results manager it already started as a parameter. The fetchers would then keep that pid in their state and use it for message passing. Rinse and repeat for every step in your chain.
The fragility inherent there is if one of the processes crashes and restart, the pids change, and that will cause a ripple effect of brokenness. So you would have to repair those links.
Probably more robust, you could use Registry (or syn for something more flexible) to register the pids for a given crawl job, and then they can look each other up using the registry.
But if you haven’t already looked at GenStage, I would definitely start there. It is the easy path