Hello,
I working on a project that need to scrap web pages from time to time, and some of them require a “visit from a browser” to prevent scrapping, e.g., many website using Cloudflare protection:
iex(19)> Req.get("https://www.cell.com/chem/fulltext/S2451-9294(24)00494-7")
{:ok,
%Req.Response{
status: 403,
...
What’s the best way to solve this task with Elixir in 2024?
I can think of three possible ways:
- Use Wallaby (Commits · elixir-wallaby/wallaby · GitHub)
- Use Crawly (GitHub - elixir-crawly/crawly: Crawly, a high-level web crawling & scraping framework for Elixir.), which seems to be a bit of an overkill for my use case
- Use GitHub - andrewvy/chrome-remote-interface: Elixir Client for the Chrome Debugger Protocol (doesn’t seem to be supported anymore)
It looks like Wallaby is my best bet, but I’m curious if people have any comments or other suggestions here?