What's the best way to repeatedly and concurrently hit a website for scraping purposes?

cpt · May 13, 2017, 9:32am

New to Elixir and I’ve been thinking of a few ways to repeatedly hit a website the fastest way possible while keeping count of the number of requests.

I read this: How can I schedule code to run every few hours in Elixir or Phoenix framework? - Stack Overflow

and also saw it on: GenServer — Elixir v1.17.1

So I did something like this:

defmodule Scraper.Scrape do
  use GenServer

  def start_link do
    GenServer.start_link(__MODULE__, %{:count => 1})
  end

  def init(state) do
    schedule_work() # Schedule work to be performed at some point
    {:ok, state}
  end


  def handle_info(:work, state) do
    # Do the work you desire here
    IO.puts "HANDLING INFO for #{inspect(self())}"
    scrape(state)
    schedule_work() # Reschedule once more
    {_, state} = Map.get_and_update(state, :count, fn(x) -> {x, x + 1} end)
    {:noreply, state}
  end

  defp schedule_work() do
    Process.send_after(self(), :work, 1)
  end


  defp scrape(state) do

    url ="http://www.ebay.com"

    count = state[:count]

    IO.puts "HTTP get for ##{count} before get for #{inspect(self())}"
    response = HTTPoison.get(url, [], [])

    case HTTPoison.get(url) do
      {:ok, %HTTPoison.Response{status_code: 200, body: body}} ->
        IO.puts "SUCCESS"
      {:ok, %HTTPoison.Response{status_code: 404}} ->
        IO.puts "Not found :("
      {:error, %HTTPoison.Error{reason: reason}} ->
        IO.puts "HTTPoison.Error = #{inspect(reason)}"
    end


  end

end

Output:

It’s synchronous now but is there a way to do this asynchronously while repeatedly hitting the same url?

I know I can start multiple processes of this GenServer by calling:

{_, pid1} = Scraper.Scrape.start_link
{_, pid2} = Scraper.Scrape.start_link

I’ve been thinking about how to use Task.async after this after reading the links below but haven’t thought up of a good solution.

[EATBenchmark, an Elixir Project]
(EATBenchmark, an Elixir Project)

http://michal.muskala.eu/2015/08/06/parallel-downloads-in-elixir.html

Any help?

Thx!

kokolegorille · May 13, 2017, 3:57pm

You can do it with Task

https://hexdocs.pm/elixir/Task.html

I use this with poolboy to avoid ddos on external server.

eg:

mix.exs

[
  {:poolboy, "~> 1.5"},
  {:httpoison, "~> 0.11"},
  {:floki, "~> 0.14"}
]

Sample code

@genserver_call_timeout 1_000_000
@task_async_timeout 1_000_000

…

tasks = Enum.map(list, fn({link, filename} = _tupple) →
Task.async(fn → :poolboy.transaction(:worker,
&(GenServer.call(&1, {:download, link, filename}, @genserver_call_timeout)), @task_async_timeout)
end)
end)
result = Enum.map(tasks, fn(task) → Task.await(task, @task_async_timeout) end)

You need to make your own worker… and you need to complete the code…

cpt · May 14, 2017, 12:09am

Hi koklegorille,

thx! I thought about doing it this way too. But with a list it’s finite so I would have to loop through the list repeatedly probably using recursion.

I’m trying to see if I can put this code inside the GenServer and save the task within the state variable and then on the next iteration, run Task.await(task) by getting that task from the state variable.

task = Task.Supervisor.async_nolink(Scraper.TaskSupervisor, fn ->

  scrape_async(state)

end)

{_, state} = Map.get_and_update(state, :task, fn(x) -> {x, task} end)

Not sure if this will work but going to try it out and see what happens.

kokolegorille · May 14, 2017, 2:02am

You are not scraping, you are ddos ebay in your example