Multiple Hound Sessions, but with different IP addresses?

I am creating a web scraper/bot. Because the site that I am scraping is very JS heavy I chose Hound with PhantomJS running.

Right now if I have 5 people that want to use the bot, they log in and provide the username and password for their account for the site they want to scrape. Each account spawns a new process, and uses Hound to bot the site.

My question is, is there any way to spawn each process with a different IP address? I think that you can configure the PhantomJS to run off a proxy, but is there a way to have each browser run off a proxy?

From my experience with watir it should be possible, I’m running 84 different phantomJS processes on one computer without any cross contamination between them, and that’s ruby I wonder how easy it should be with Elixir.
Only reason it’s 84 is that on Amazon Linux the limit is 85 before I finished all the ports, and the memory was low any way so didn’t try to fix it or even check what was the problem.

1 Like

I originally start writing this in Ruby with Watir, but switched to Elixir because each browser represents a users botting a site and so I made that a GenServer process and used OTP to make it fault tolerant, something I had trouble with when I was using Rails.

If this site takes off we estimate the max amount of users we’ll sign up is 8-9k, but at that point only 1k active users using the bot concurrently. This still means 1k Hound sessions that are going to be established. I am no sys admin, but when you use a port are you running it through a VPN or proxy of some sort or does the port change part of the ip address? Because if it did, then I wouldn’t have a problem running one or two hundred session off each port to help lessen the traffic from my DO server.

I have it running on a Ubuntu server on Digital Ocean and I know that the site that I am botting is aware of DO and it’s IP address.

My problem was with phantomjs that after the 85 instance it couldn’t get a new port so it wouldn’t start, you should test it before moving on, my scalability was limited by that + memory phantomjs just love memory even for simple sites so 4gd server couldn’t hold more.
I used Sinatra to move the requests to multiple servers keeping the state in json and passing it back to our elb on fails, each server could deal with any request.
But to say the truth I got this idea from here and wouldn’t have used it if I didn’t know Elixir, but I had to get everything up and running in about a month if I had 3 months I would have done it using Elixir.
In watir proxy definitions are defined for each phantomjs by the use of the driver so I’m guessing you should find it in the same place in hound.
Also if you need loads of ips using proxies you should check luminati.io their data center offer is not bad.

1 Like

Thanks for the proxy site! I was about to start searching for one.

With PhantomJS would you be able to run multiple_sessions per port? For example, were you able to have only 84 sessions that are running a proxy, or 84 different open ports that could each handle multiple sessions (all with the ip of that port).

If you anticipated running 1k different sessions concurrently on a JS heavy site, would you still choose PhantomJS or is it too much? Right now, I’ve only tested with 5-10 different sessions at once so far.

Never tried to run more than one session per instance, just thinking about all the concurrency problems that might happen would get my pants dirty, especially when running so many instances.
Come to think about it I already have 1k concurrent phantomjs sessions (12 computers) and I tested it and everything is working great even in prod (AWS t2.medium).
I would test chrome + Firefox headless before making a decision, too many unknown variables for me to give any meaningful advice.
Do some testing understand your constraints memory/available instances for each platform etc.
What I did is restart each phantomjs instance after I used it to make sure my memory is clean, you might not need to do it.
I would also recommend opening your browser in advance, when I tried opening each when needed I lost 10 seconds in overall time, it took the server 1sec to open each session and that wasn’t truly concurrent.

PS- not sure what would happen with session instead of instances as I never tried it always had to either deal with concurrency or one instance only and always with ruby, which is not the best case scenario, I wonder how hound would deal with it, and if each session uses a different port too.

1 Like