CommonCrawl - work with Common Crawl data

I have worked a lot with Common Crawl data in the past.
During that time I started to extract more and more functionality to this library.
It has both convenience and more basic functions.

If you are new to Common Crawl you can start working with it easily using the convenience functions:

CommonCrawl.get_latest_for_url gets the latest crawl for a given URL.

iex> CommonCrawl.get_latest_for_url("https://example.com")
{:ok, %{response: _, headers: _, warc: _}}

CommonCrawl.Index.stream streams all metadata from the index for further processing

iex> crawl_id = "CC-MAIN-2024-51"
iex> CommonCrawl.Index.stream(crawl_id) |> Enum.take(2)
[
  {"0,100,22,165)/", 20241209080420,
   %{
     "filename" => "crawl-data/CC-MAIN-2024-51/segments/1733066461338.94/crawldiagnostics/CC-MAIN-20241209055102-20241209085102-00443.warc.gz",
     "length" => "686",
     "mime" => "text/html",
     "mime-detected" => "text/html",
     "offset" => "887",
     "redirect" => "https://157.245.55.71/",
     "status" => "301",
     "url" => "http://165.22.100.0/"
   }}, ...]

Everything is documented in the docs. Have fun!
Hex: https://hex.pm/packages/common_crawl
GitHub: https://github.com/preciz/common_crawl

If some common requests don’t seem to work a VPN might help. I have no information about why recently they are so eager to block requests.

4 Likes