Is Elixir well suited to crawling and web automation compared to other languages and frameworks?

mbenatti · May 7, 2017, 9:17pm

One of the biggest Topic’s of the tecnology are being Data(and big data) and a good part it, as well as model of revenue of a lot of companies.

Thinking about it and trying to insert the Elixir to the mainstream the doubt is, Is Elixir good for crawling and web automation?

Crawling

The most used, with ease and featured library is the Scrapy, there are a lot o companies using it.

For a good crawling/scrape a language must have some requirements, like:

Great HTTP libraries and html parsers
Powerful text processing capabilities
Non blocking I/O
Robust pre-existing crawling frameworks

To compete and to try achieve a crawlng with this we have some early stages libraries, like Floki, scrape, and others…

We all know that Elixir is not so good to process a large number of text/data, in some of this libs, they use rust lang to deal with that process…

Web Automation

About it we have a library called hound, I didn’t test it but I would like to hear more from users.

Are there anybody using Elixir for crawling or web browser automation??

kokolegorille · May 7, 2017, 9:44pm

There is a similar thread here

Maybe it helps…

mbenatti · May 7, 2017, 11:19pm

Thanks!

It’s good to get some feedbacks from people that use tools like these in others languages too

andre1sk · May 8, 2017, 1:08am

Big Data and Web Scraping are very orthogonal topics. For many use cases Elixir would be very good fit.

adammokan · May 9, 2017, 5:43pm

@mbenatti

I’ve been doing some large-scale crawling/scraping with Elixir for almost a year now. I can say its been a perfect fit. See some info here - Web scraping tools or look up other threads of mine for more details. I’m doing 100% headless browser crawling via ports in a distributed system due to the scalability requirements I have.

We’ve never had performance issues with Floki parsing or performance around that aspect. Its one of the lightest parts of the system for me, honestly. I think it’s nice to have the option to go to the more performant libraries, but I wouldn’t sweat that now if you’re just getting started.

At minimum, I do 60k crawls an hour now. That will be expanded to something close to 300k/hour minimum within the next couple months. My top end volume has not been hit yet but its not unreasonable for me to do 4-5 times my normal rate if I pop up some spot instance nodes. It all depends on how many nodes I add to my system.

Good luck!

mbenatti · May 11, 2017, 5:10am

Thank you for sharing your experience.

The use of NightmareJS is because Elixir/Floki can’t handle in some cases?

mischov · May 11, 2017, 5:19am

He mentions in the linked thread that he uses Floki to parse, and NightmareJS to crawl because he can’t use HTTPoison (not HTML content only).

esconsut1 · May 11, 2017, 4:00pm

We’ve been using Floki to parse HTML for millions of web pages in the last few months.
Floki has been the fastest part!

Verk for sidekiq compatible workers also enables the concurrency we want.

Respecting robots.txt was a bit of a challenge.

And our PG database is getting a little slow. Might have to store web pages elsewhere soon.

Otherwise its been a very interesting journey. We learned Elixir with this project.