I was building a crawler (for fun; school gets boring) when I realised the need for a robust robots.txt parser and cache. I looked around and found that all the existing Elixir robots.txt parsers are all not updated at all. So I built
It’s my first Hex.pm package (actually it’s my first package anywhere in general) so I was hoping for some feedback on anything; From the code, to the README, to the docs, literally anything. Criticism is welcomed.
The thing I’m most proud of with
Gollum is the concurrent fetching feature.
Gollum uses a central cache (GenServer) to store all the robots.txt data. Let’s say 2 requests come in to fetch the robots.txt files from
facebook.com at the same time. They will be fetched concurrently in their own processes and completely independent of one another. It also builds up a buffer the processes that are waiting for the data. So if 5 requests come in 1 after the other for the data from
google.com, they will all wait on the same GET request, and not make 5 GET requests.