Tips on building a web scraper

I am attempting to make a web scraper that started out as a tool for just one person. As word got around I am finding out that there is actually somewhat of a need for this service. I started out building this with the intention of learning how to integrate OTP into a phoenix app and so far so good.

The scraper itself requires a user to log in to the targeted SPA, and monitor items in an auction board. When an item drops that fits the users parameters, it alerts the user. Right now I am using Hound and PhantomJS headless so that JS is rendered, but it was also just a choice I made when I was thinking about using this for just one person.

What are some common pitfalls that I should make sure that I am watching out for when making requests from several different users? I am making sure that each user is using the same User Agent each request, but how would I go about setting a proxy for each connection? I ask this because currently I have one user that will log in and monitor the trading board pulling in each new “item” that gets put up, but I would like to eventually implement the “purchase” with the bot as well. For that I would need my application to login as the other user and make the purchase. I am currently having the user log in and my application stores their cookie for credentials for that part and not their actual credentials for the site.

What is the legality of using a bot to log in to another site as well? Is there a proper way to scrape a site that uses authorization? If anyone that has experience scraping sites at a large scale I would love to talk with them outside of the forums and possibly get some more advice.

1 Like

As a first pointer, there was a really nice talk at ElixirConf 2019 by Adam Mokan on exactly that topic:


I coincidentally sent him a PM yesterday because of some of his responses on the forum to scraping. I didn’t realize he had a talk. Thanks!

1 Like