Context in flow of multiple processes

dalhorinek · October 13, 2018, 4:20pm

Hello,

I’m new to Elixir and kind of new to the whole architecture of separate applications communicating together.
I’m working on a Crawler of some webpages, I need to scan whole web and store results into database.

I have this architecture right now

  Crawler->start()
     |- UrlManager.start_link()
     |- QueueManager.start_link()
     |- Fetcher.start_link() // 2x
     `- ResultsManager.start_link()

   Crawler->crawl("https://www.somedomain.com")
          | 
          v
   +--------------+       +--------------------------+
   | Url Mangager |  ---> | Queue Manager / Producer |
   +--------------+       +--------------------------+
         ^                            |
         |                            v
         |                   +--------------------+
          `------------------| Fetcher / Consumer |
                             +--------------------+
                                    |
                                    v
                           +-----------------------+
                           | Results Manager / DB  |
                           +-----------------------+

Currently one process calling another process and assuming there is a process name based on a module.

Now the question is, what’s better approach if I need crawl mutiple domains.
I wanted to create instances of all apps and pass these

  domain = "https://www.somedomain.com"
  {:ok, results_manager} = ResultsManager.start_link()
  {:ok, fetcher} = Fetcher.start_link(results_manager)
  {:ok, queue_manager} = QueueManager.start_link(fetcher)
  {:ok, url_manager} = UrlManager.start_link(queue_manager, domain)

The problem is that there is cyclic dependency of Url Manager, Queue Manager and Fetcher, so this approach isn’t probably good one.

Is there any best practice how to do similar architecture in Elixi? How to scope a flow through multiple apps?

Sorry for newbie question …

aseigo · October 15, 2018, 1:41pm

This sounds like an excellent use case for GenStage, or perhaps even Flow which is a streamlined and simplified means to utilize GenStage in your project.

Basically, they provide this “chain of producers and consumers, with parallelization” pattern in a library.

If you do want to roll the whole thing yourself, or have already discarded GenStage as an option for one reason or another, you can of course manage this yourself, too. One way to do this would be to have a Dynamic Supervisor that spawns a supervisor per crawl job. That spawned supervisor would then spawn off the set of required workers to perform a given crawl job, perhaps using start_child directly which returns the pid of the child. The crawl supervisor could then collect those pids and pass the relevant pids around to the other children.

For instance, it could start the results manager using start_child, get its pid, and then when it starts the fetchers, pass in the pid of the results manager it already started as a parameter. The fetchers would then keep that pid in their state and use it for message passing. Rinse and repeat for every step in your chain.

The fragility inherent there is if one of the processes crashes and restart, the pids change, and that will cause a ripple effect of brokenness. So you would have to repair those links.

Probably more robust, you could use Registry (or syn for something more flexible) to register the pids for a given crawl job, and then they can look each other up using the registry.

But if you haven’t already looked at GenStage, I would definitely start there. It is the easy path

dalhorinek · October 21, 2018, 12:12pm

Hello @aseigo, thanks for pointing me out, I’ve reworked it with Registry, so there is external repository of pids so I managed to remove the circular dependency problem.

Also I’ve already used the GenStage and thank you for pointing to Flow, it seems interesting and I’ll possibly use it in next iteration.