Hi everyone,
I recently made a library of Req plugins for some web-crawl-adjacent stuff I’m working on and figured I would share it.
The library is called ReqCrawl
and is a place for Req plugins related to common crawling functions.
Right now it supports parsing robots.txt
and sitemap
’s, but there’s potential to add more plugins for other features.
Plugins
ReqCrawl.Robots
A Req plugin to parse robots.txt files
You can attach this plugin to any %Req.Request
you use for a crawler and it will only run against
URLs with a path of /robots.txt
.
It outputs a map with the following fields:
:errors
- A list of any errors encountered during parsing:sitemaps
- A list of the sitemaps:rules
- A map of the rules with User-Agents as the keys and a map with the following values as the fields::allow
- A list of allowed paths:disallow
- A list of the disallowed paths
ReqCrawl.Sitemap
Gathers all URLs from a Sitemap or SitemapIndex according to the specification described
at https://sitemaps.org/protocol.html
Supports the following formats:
.xml
(forsitemap
andsitemapindex
).txt
(forsitemap
)
Outputs a 2-Tuple of {type, urls}
where type
is one of :sitemap
or :sitemapindex
and urls
is a list
of URL strings extracted from the body.
Output is stored in the ReqResponse
in the private field under the :crawl_sitemap
key