I have worked a lot with Common Crawl data in the past.
During that time I started to extract more and more functionality to this library.
It has both convenience and more basic functions.
If you are new to Common Crawl you can start working with it easily using the convenience functions:
CommonCrawl.get_latest_for_url
gets the latest crawl for a given URL.
iex> CommonCrawl.get_latest_for_url("https://example.com")
{:ok, %{response: _, headers: _, warc: _}}
CommonCrawl.Index.stream
streams all metadata from the index for further processing
iex> crawl_id = "CC-MAIN-2024-51"
iex> CommonCrawl.Index.stream(crawl_id) |> Enum.take(2)
[
{"0,100,22,165)/", 20241209080420,
%{
"filename" => "crawl-data/CC-MAIN-2024-51/segments/1733066461338.94/crawldiagnostics/CC-MAIN-20241209055102-20241209085102-00443.warc.gz",
"length" => "686",
"mime" => "text/html",
"mime-detected" => "text/html",
"offset" => "887",
"redirect" => "https://157.245.55.71/",
"status" => "301",
"url" => "http://165.22.100.0/"
}}, ...]
Everything is documented in the docs. Have fun!
Hex: https://hex.pm/packages/common_crawl
GitHub: https://github.com/preciz/common_crawl
If some common requests don’t seem to work a VPN might help. I have no information about why recently they are so eager to block requests.