Basically I am creating a webscraper that will get sent/pull of a server a list of urls that it needs to scrape. Each of these will links will be apart of a certain “site” e.g. Twitter, Reddit ect…
What I currently have is a Super Supervisor who starts each site supervisor e.g. Twitter, Reddit etc. The site super visor then starts a queue and a pool supervisor. The Queue is a GenServer that gets sent the links it needs to add to the queue. The Pool sup creates the worker pool (using poolboy) servers.
At the moment, I have a method on the Queue Server that calls this:
def handle_call({:process}, _from, state) do
case state do
[] ->
:timer.sleep(1000)
_ ->
tasks = Enum.map(state, fn(link) ->
Task.async(fn ->
:poolboy.transaction(
:site_pool,
fn(worker) -> Sherlock.SiteWorker.start_task(worker, link) end, :infinity)
end )
end )
Enum.each(tasks, &Task.await/1)
end
{:reply, :ok, []}
end
The worker then uses HTTPoison
and Floki
to parse the website and get the data from it.
Firstly, is this a goodway to go about it? Or is there a better approach?
In the end my goal is to have it so the queue will keep checking its state and if it has a new link in the queue, then it will ask for a free worker to process that link.
I am struggling with how I go about introducing this “constant checking” by the queue.
I also keep seem to be getting time-outs with this method (e.g. if I add 20 links and ask it investigate, I will get a timeout error after 5 seconds for some of the links). Not to sure why though - is it a httpoison timeout or poolboy timeout? Here is an example of the error.
Some help would be greatly appreciated - even willing to pay for someone to jump on skype or whatever and explain to me what I am doing wrong/a better design decision.