Your opinions on a web crawler architecture please

r6203 · March 27, 2018, 5:39pm

Thanks for the links.

As for Python/Scrapy vs Elixir… the Elixir project is going to be a pure learning project so I’m not that worried about some tradeoffs.

Regarding the “specific clients for each site”, you mentioned: The only site that needs to get special parsing rules etc. is the listing page. Each entry found on the listing page is parsed in a generic way so I don’t need different clients for the entry websites.

I’ve got one question regarding supervisors/genserver:

Assuming that I’ve got a module for handling the listing sites and a module to handle individual entry sites, how could I trigger a new “parsing job” for an entry once the listing site crawler found one?

I was thinking about creating a supervisor, who’s job is it to start the initial crawling of the listing site. The supervisor also stores a queue of all entries found so far.

What I then need is some way to pass a message from the listing crawler back to the supervisor once an entry is found.
Once the supervisor receives such a message, he stores the entry in the queue and launches the parsing of the entry.

So, is it possible in Elixir to communicate back from a worker to the calling process? I think I could pass the pid of the supervisor to the listing crawler when starting it but this doesn’t seem right.