Scraping Js heavy website

Zesky665 · July 17, 2018, 10:12pm

I want to develop a web scraper that can read html generated client-side by js. From what I’ve read I’m not going to be able to do this with just a regular html parser like floki. Which libraries can I use to get the html of a webpage that is generated client side?

dimitarvp · July 17, 2018, 10:30pm

See this thread. It has a lot of useful info.

You might also take a look at:

GitHub - fredwu/crawler: A high performance web crawler / scraper in Elixir.
GitHub - nietaki/crawlie: A simple Elixir library for writing decently-performing crawlers with minimum effort.

mischov · July 18, 2018, 12:41am

The HTML-parser portion of the scraper doesn’t need to be different to handle HTML generated by JS- after all, it’s still just HTML. Meeseeks or Floki will work fine.

What does need to be different is how you fetch the HTML. You can’t use an HTTP client like HTTPoison, you need to use something that drives something browser-ish that can let the JS evaluate. Hound and Wallaby are common suggestions.