Web scraping tools

adammokan · May 9, 2017, 5:32pm

@_russellb - I’ve been using a combo of NightmareJS via ports (using Porcelain). I’ve been scraping at a pretty large scale since last summer with Elixir. There are some more details in some previous threads of mine if you search.

The only downfall with Nightmare or any JS-capable solution is the CPU/memory resources needed. If you are just needing HTML capabilities, HTTPoison (or swap with HTTPotion, raw hackney, etc) works fine in my experience. Unfortunately, I do not have the HTML only luxury as I’m dealing with single page app, JS required systems. As far as the heavy resource load for JS stuff via headless browsing - I have a distributed elixir/erlang solution in place with dedicated scraping nodes I can bring up dynamically and dispatch work to from a master node. Everything is running headless on linux boxes (some in our own datacenter and some on AWS spot instances).

For parsing the results, I use Floki like most people do. Speed is never an issue here for me so I’ve not bothered with the other parsers out there.

FWIW, we came from a lot of legacy Ruby/Mechanize code. The Elixir stuff is much, much easier to maintain and way more stable. I do have times where Nightmare acts up and doesn’t release resources, but I resolve that with some elixir-based cleanup tasks that run every few minutes to check for zombie processes that get missed. Porcelain helps a lot for the port logic, but is not 100%. I also keep all crawling off my main logic node so I can afford to lose a crawl and just bounce a node without sweating it. I’d recommend something similar if you need large scale throughput. I just assume the job was lost if I do not get a reply back from a crawling node within five minutes and send it to another node.

Good luck!