Hi! I intend to build a simple web crawler in Elixir, and since a web crawler is really just a pipeline of url-queue -> fetcher -> parser (with some filters and rate limiters in between) I have decided to use GenStage
for this. I have already built the url queue. Now, I’m thinking of how to build the fetcher.
Obviously, the fetcher will need to perform HTTP requests in parallel, and since there is a limit to how many I should do at a time (bandwidth right…? correct me if I’m wrong about this), I should use some sort of pooling system. I have looked at ConsumerSupervisor
in the GenStage
docs, but it is only for the last part, the consumer
portion.
How should I go about implementing this, where the fetcher stage is a producer_consumer
(So i cant use the ConsumerSupervisor)?