I’ve been doing some large-scale crawling/scraping with Elixir for almost a year now. I can say its been a perfect fit. See some info here - Web scraping tools or look up other threads of mine for more details. I’m doing 100% headless browser crawling via ports in a distributed system due to the scalability requirements I have.
We’ve never had performance issues with Floki parsing or performance around that aspect. Its one of the lightest parts of the system for me, honestly. I think it’s nice to have the option to go to the more performant libraries, but I wouldn’t sweat that now if you’re just getting started.
At minimum, I do 60k crawls an hour now. That will be expanded to something close to 300k/hour minimum within the next couple months. My top end volume has not been hit yet but its not unreasonable for me to do 4-5 times my normal rate if I pop up some spot instance nodes. It all depends on how many nodes I add to my system.