Crawling the web is something that a large number of people do, but few people really want to talk about. I feel like there is not enough knowledge sharing on this topic, and I want to share my experiences over the past decade crawling at scale.
We will look at how I used Elixir to orchestrate a pool of distributed, dynamic headless crawler nodes and go over the things I got wrong, how I resolved them, and more.
Even if you have no interest in crawling the web, I’ve learned over the years that knowledge of how to crawl the web in a resilient manner shares a number of overlapping similarities to large-scale data integration with 3rd party APIs.
General awareness of OTP, GenStage, distributed systems, headless browser APIs, and Amazon Web Services are a plus.