Hello,
I’m building a somewhat unique tool and I’m looking for guidance on how to best architect it.
Basically I need to crawl 100 web pages, collect all links (a tags) and follow them to find out the final destination of the link (follow redirects, some of them use short URLs etc.) to check for the destination domain. Each page can have ~200 links, so I need to potentially follow 20k links.
I am using Elixir 1.10.2 and Phoenix 1.5.1.
Here is what I had in mind, please let me know if there is a better/more efficient way to achieve this.
- Use a LiveView module as a GenServer, this way I can show progress live in the browser
- Use Task.async to fetch the 100 pages concurrently
- Catch the responses with a handle_info function. Use Floki to parse and find all a tags (links). Use Enum.each on the links and make the http request in another Task.async for each of them. I would use Tesla with Hackney, as Tesla has a middle ware to follow redirects. I would do a head request, as I don’t need the body.
- Catch the responses with another handle_info function and check for the domain. If it matches the domain, update the socket to show the link, otherwise just return the unmodified socket
Here is what I had in mind code-wise (not implemented, just laying out the basic idea).
defmodule Example.PageLive do
use Example, :live_view
def mount(_params, _session, socket) do
{:ok, socket}
end
def handle_event("submit", _, socket) do
# the initial 100 pages
pages = ["http://example.com"]
# Task.async each of them
Enum.each(pages, fn page ->
Task.async(fn ->
# load the links
links =
page
|> Tesla.get()
|> Floki.parse_and_find_links()
# send the response to handle_info
{:page_loaded, links}
end)
end)
# could show status
{:noreply, socket}
end
def handle_info({_pid, {:page_loaded, links}}, socket) do
# Task.async each of them
Enum.each(links, fn link ->
Task.async(fn ->
# find destination with Tesla.head and the follow_redirects middleware
destination = Tesla.head(link)
# send the response to handle_info
{:link_loaded, destination}
end)
end)
# could show status
{:noreply, socket}
end
def handle_info({_pid, {:link_loaded, destination}}, socket) do
if URI.parse(destination).host == "example.com" do
# do stuff if destination is the domain we're looking for
{:noreply, socket}
else
# ignore otherwise
{:noreply, socket}
end
end
end
Is this a sensible approach? Do you think that Tesla and Hackney are good choices? I don’t have much experience with GenServers and stuff like that. What sort of server would I need to be able to run all of this specs wise? Also I read a little bit about hackney pools, do I need to do anything? Also will using a LiveView for my GenServer cause any performance issues?
And last question: Is Elixir a good choice for doing something like this?
Thanks heaps in advance!