Crawler Data

I usually use httpoison with floki. There has been some discussion about it

You can replace floki with

For any interaction with a database (postgres by default) you can use ecto.

So my approach would be (roughly) like this:

defmodule Crawler do
  def crawl!(url) do
    %HTTPoison.Response{body: body, status: 200} = HTTPoison.get!(url)
    html = Floki.parse(body)
    contents = Floki.find(html, "article") # or whatever you are interested in
    # see ecto docs to understand what Repo does
    Crawler.Repo.insert!(%Article{contents: contents})
    urls =
      html
      |> Floki.find("a")
      |> Enum.map(fn anchor -> Floki.attribute(anchor, "href") end)
    # spawn more tasks to crawl other pages, or keep crawling in the current process
  end
end
6 Likes