Hi everyone,
I was building a crawler (for fun; school gets boring) when I realised the need for a robust robots.txt parser and cache. I looked around and found that all the existing Elixir robots.txt parsers are all not updated at all. So I built Gollum
.
It’s my first Hex.pm package (actually it’s my first package anywhere in general) so I was hoping for some feedback on anything; From the code, to the README, to the docs, literally anything. Criticism is welcomed.
Thanks !
Additional stuff:
The thing I’m most proud of with Gollum
is the concurrent fetching feature. Gollum
uses a central cache (GenServer) to store all the robots.txt data. Let’s say 2 requests come in to fetch the robots.txt files from google.com
and facebook.com
at the same time. They will be fetched concurrently in their own processes and completely independent of one another. It also builds up a buffer the processes that are waiting for the data. So if 5 requests come in 1 after the other for the data from google.com
, they will all wait on the same GET request, and not make 5 GET requests.