Simple crawler using GenStage (questions about the structure)

ravernkoh · September 28, 2017, 3:46pm

Hi! I intend to build a simple web crawler in Elixir, and since a web crawler is really just a pipeline of url-queue -> fetcher -> parser (with some filters and rate limiters in between) I have decided to use GenStage for this. I have already built the url queue. Now, I’m thinking of how to build the fetcher.

Obviously, the fetcher will need to perform HTTP requests in parallel, and since there is a limit to how many I should do at a time (bandwidth right…? correct me if I’m wrong about this), I should use some sort of pooling system. I have looked at ConsumerSupervisor in the GenStage docs, but it is only for the last part, the consumer portion.

How should I go about implementing this, where the fetcher stage is a producer_consumer (So i cant use the ConsumerSupervisor)?

mischov · September 28, 2017, 4:47pm

For some ideas you might check out Crawlie, which I believe uses GenStage.

Qqwy · September 28, 2017, 10:18pm

If your crawler wants to grow to become more sophisticated, I’d suggest looking into Event Sourcing: By persisting the results of each stage independently, you could at some point decide that the parser can be smarter (such as extract more info) without having to re-crawl all previously-crawled resources.

As for implementing the fetcher stage itself, I think Crawlie’s UrlManager indeed gives some good suggestions . Great tip, @mischov!