Struggling to get a Worker Queue/Process Queue working properly. Not sure if Implemented right -design wise

Basically I am creating a webscraper that will get sent/pull of a server a list of urls that it needs to scrape. Each of these will links will be apart of a certain “site” e.g. Twitter, Reddit ect…

What I currently have is a Super Supervisor who starts each site supervisor e.g. Twitter, Reddit etc. The site super visor then starts a queue and a pool supervisor. The Queue is a GenServer that gets sent the links it needs to add to the queue. The Pool sup creates the worker pool (using poolboy) servers.

At the moment, I have a method on the Queue Server that calls this:

  def handle_call({:process}, _from, state) do
     case state do
      [] ->
        :timer.sleep(1000)
      _ ->
         tasks = Enum.map(state, fn(link) ->
          Task.async(fn ->
            :poolboy.transaction(
              :site_pool,
               fn(worker) -> Sherlock.SiteWorker.start_task(worker, link) end, :infinity)
          end )
        end )
        Enum.each(tasks, &Task.await/1)
    end
    {:reply, :ok, []}
  end

The worker then uses HTTPoison and Floki to parse the website and get the data from it.

Firstly, is this a goodway to go about it? Or is there a better approach?

In the end my goal is to have it so the queue will keep checking its state and if it has a new link in the queue, then it will ask for a free worker to process that link.

I am struggling with how I go about introducing this “constant checking” by the queue.

I also keep seem to be getting time-outs with this method (e.g. if I add 20 links and ask it investigate, I will get a timeout error after 5 seconds for some of the links). Not to sure why though - is it a httpoison timeout or poolboy timeout? Here is an example of the error.

Some help would be greatly appreciated - even willing to pay for someone to jump on skype or whatever and explain to me what I am doing wrong/a better design decision.

1 Like

You have problems with timeouts. The first is that your handle_call will block for as long as it takes all the started tasks to finish. This is not good for many reasons. There is a timeout in the GenServer.call which has a default value of 5 sec and if/when it times out it generates an exception. You have the same problem with timeouts in Task.await which also have default 5 sec timeouts which generates exceptions.

Generally speaking it is never very good to block servers in this way if you can avoid it.

1 Like