Help with picking the right tools to perform a lot of concurrent http requests

ryan-senn · April 28, 2020, 5:09am

Hello,

I’m building a somewhat unique tool and I’m looking for guidance on how to best architect it.
Basically I need to crawl 100 web pages, collect all links (a tags) and follow them to find out the final destination of the link (follow redirects, some of them use short URLs etc.) to check for the destination domain. Each page can have ~200 links, so I need to potentially follow 20k links.
I am using Elixir 1.10.2 and Phoenix 1.5.1.

Here is what I had in mind, please let me know if there is a better/more efficient way to achieve this.

Use a LiveView module as a GenServer, this way I can show progress live in the browser
Use Task.async to fetch the 100 pages concurrently
Catch the responses with a handle_info function. Use Floki to parse and find all a tags (links). Use Enum.each on the links and make the http request in another Task.async for each of them. I would use Tesla with Hackney, as Tesla has a middle ware to follow redirects. I would do a head request, as I don’t need the body.
Catch the responses with another handle_info function and check for the domain. If it matches the domain, update the socket to show the link, otherwise just return the unmodified socket

Here is what I had in mind code-wise (not implemented, just laying out the basic idea).

defmodule Example.PageLive do
  use Example, :live_view

  def mount(_params, _session, socket) do
    {:ok, socket}
  end

  def handle_event("submit", _, socket) do
    # the initial 100 pages
    pages = ["http://example.com"]

    # Task.async each of them
    Enum.each(pages, fn page ->
      Task.async(fn ->
        # load the links
        links =
          page
          |> Tesla.get()
          |> Floki.parse_and_find_links()

        # send the response to handle_info
        {:page_loaded, links}
      end)
    end)

    # could show status
    {:noreply, socket}
  end

  def handle_info({_pid, {:page_loaded, links}}, socket) do
    # Task.async each of them
    Enum.each(links, fn link ->
      Task.async(fn ->
        # find destination with Tesla.head and the follow_redirects middleware
        destination = Tesla.head(link)

        # send the response to handle_info
        {:link_loaded, destination}
      end)
    end)

    # could show status
    {:noreply, socket}
  end

  def handle_info({_pid, {:link_loaded, destination}}, socket) do
    if URI.parse(destination).host == "example.com" do
      # do stuff if destination is the domain we're looking for
      {:noreply, socket}
    else
      # ignore otherwise
      {:noreply, socket}
    end
  end
end

Is this a sensible approach? Do you think that Tesla and Hackney are good choices? I don’t have much experience with GenServers and stuff like that. What sort of server would I need to be able to run all of this specs wise? Also I read a little bit about hackney pools, do I need to do anything? Also will using a LiveView for my GenServer cause any performance issues?

And last question: Is Elixir a good choice for doing something like this?

Thanks heaps in advance!

LostKobrakai · April 28, 2020, 7:28am

Your general approach seems fine. I’d switch out Task.async with something better at dealing with potential of overwhelming the system/network/the found endpoints… like GenStage. Also I’d be cautious of starting an exponentially growing numbers of Tasks concurrently without limit. The BEAM is great at handling the processes, but timeouts might spiral out of control. I’d rather look at one (or a set of) queues to schedule work onto.

If you find performance problems then I’d start looking into alternative http clients.

quatermain · April 28, 2020, 8:27am

This is maybe ready-made solution for you Crawly or you can inspire from their solution.

dimitarvp · May 3, 2020, 8:45am

The Tasks would be dangling without supervision.

If you don’t want to throttle the number of parallel requests to a single domain then you should probably have one Agent that tracks progress: it can just return a tuple like {processed, total}. You’d need such a dynamically updating total counter because you can’t know all links you need to follow at the start of your work (you said each link’s contents will be scanned for other links) so as each initial link’s contents are successfully scraped you can just increase the total amount in your Agent (or you can just use a pair of atomic counters). Then have your LiveView report that progress.

As for the downloading / scraping tasks (not the Elixir Tasks in this case) I’d go for a DynamicSupervisor since you don’t need throttling for the moment. You can just do DynamicSupervisor.start_child(...) as each of your links that need work is being discovered.