Gollum - Robots.txt parser and cache

ravernkoh · October 18, 2017, 2:41am

Hi everyone,

I was building a crawler (for fun; school gets boring) when I realised the need for a robust robots.txt parser and cache. I looked around and found that all the existing Elixir robots.txt parsers are all not updated at all. So I built Gollum.

It’s my first Hex.pm package (actually it’s my first package anywhere in general) so I was hoping for some feedback on anything; From the code, to the README, to the docs, literally anything. Criticism is welcomed.

Thanks !

Additional stuff:
The thing I’m most proud of with Gollum is the concurrent fetching feature. Gollum uses a central cache (GenServer) to store all the robots.txt data. Let’s say 2 requests come in to fetch the robots.txt files from google.com and facebook.com at the same time. They will be fetched concurrently in their own processes and completely independent of one another. It also builds up a buffer the processes that are waiting for the data. So if 5 requests come in 1 after the other for the data from google.com, they will all wait on the same GET request, and not make 5 GET requests.

OvermindDL1 · October 19, 2017, 2:30pm

Heh, that could be quite useful for the scrapers around here. ^.^