ReqCrawl - Req plugins to support common crawling functions

acalejos · April 8, 2024, 1:26am

Hi everyone,

I recently made a library of Req plugins for some web-crawl-adjacent stuff I’m working on and figured I would share it.

The library is called ReqCrawl and is a place for Req plugins related to common crawling functions.

Right now it supports parsing robots.txt and sitemap’s, but there’s potential to add more plugins for other features.

Plugins

A Req plugin to parse robots.txt files

You can attach this plugin to any %Req.Request you use for a crawler and it will only run against
URLs with a path of /robots.txt.

It outputs a map with the following fields:

:errors - A list of any errors encountered during parsing
:sitemaps - A list of the sitemaps
:rules - A map of the rules with User-Agents as the keys and a map with the following values as the fields:
- :allow - A list of allowed paths
- :disallow - A list of the disallowed paths

Gathers all URLs from a Sitemap or SitemapIndex according to the specification described
at https://sitemaps.org/protocol.html

Supports the following formats:

Outputs a 2-Tuple of {type, urls} where type is one of :sitemap or :sitemapindex and urls is a list
of URL strings extracted from the body.

Output is stored in the ReqResponse in the private field under the :crawl_sitemap key