Crawler Data

suryaniwati · February 6, 2018, 6:24am

Hi

How can I crawl a website content and save the contents to database?

Any favorite library?

Thanks,
Ani

idi527 · February 6, 2018, 6:45am

I usually use httpoison with floki. There has been some discussion about it

You can replace floki with

For any interaction with a database (postgres by default) you can use ecto.

So my approach would be (roughly) like this:

defmodule Crawler do
  def crawl!(url) do
    %HTTPoison.Response{body: body, status: 200} = HTTPoison.get!(url)
    html = Floki.parse(body)
    contents = Floki.find(html, "article") # or whatever you are interested in
    # see ecto docs to understand what Repo does
    Crawler.Repo.insert!(%Article{contents: contents})
    urls =
      html
      |> Floki.find("a")
      |> Enum.map(fn anchor -> Floki.attribute(anchor, "href") end)
    # spawn more tasks to crawl other pages, or keep crawling in the current process
  end
end

suryaniwati · February 7, 2018, 9:30am

thank you for answering, I wondered how can I use meeseeks or html5ever, it seems like it built with other languages (rust)
and how to spawn the process or crawl any link on the page?

Thanks

idi527 · February 7, 2018, 10:03am

You can have a pool (poolboy, tutorial) of scraping processes (may be genservers), each of which would accept a url to scrape (in handle_call), save the contents to a database, and return the urls that it has found (in {:reply, urls, state}). Then those urls would be fed back into the pool.

I wondered how can I use meeseeks or html5ever, it seems like it built with other languages (rust)

I guess you would have to have a rust compiler installed. See GitHub - mischov/meeseeks: An Elixir library for parsing and extracting data from HTML and XML with CSS or XPath selectors.

mischov · February 7, 2018, 5:09pm

Couldn’t have said it better.

@suryaniwati

You might look at Crawler or Crawlie for examples of more complete crawling solutions.

OvermindDL1 · February 7, 2018, 5:53pm

It has an elixir front-end. The dependency itself manages the rust part (all you need is rust/cargo installed).

I quite like meeseeks, it has such a fantastic query API, and it is blazing fast, and it works on XML too.

Eiji · February 7, 2018, 8:02pm

@suryaniwati: Let me summary everything:

Database

Easiest way to validate and save data into database is using Ecto library:

Rust dependencies

And for your last question:

It’s as simple as install rust stable (using asdf tool). The rest is as easy as add elixir dependencies and compile project.

Full setup

Ensure you have dependencies for compile sources (depends on your OS/distro)
Install asdf
Install Erlang plug-in for asdf
Install Erlang
Set Erlang version
Install Elixir plug-in for asdf
Install Elixir
Set Elixir version
Add Rust plug-in for asdf
Install Rust
Set Rust version
Add PostgreSQL plug-in for asdf
Install PostgreSQL database
Set PostgreSQL version

Example script

# Contents of file: "setup_environment.sh"

#!/bin/bash

# asdf
git clone https://github.com/asdf-vm/asdf.git ~/.asdf --branch v0.4.0
echo -e '\n. $HOME/.asdf/asdf.sh' >> ~/.bashrc
echo -e '\n. $HOME/.asdf/completions/asdf.bash' >> ~/.bashrc
source ~/.bashrc

# Erlang
asdf plugin-add erlang https://github.com/asdf-vm/asdf-erlang.git
asdf install erlang 20.2.2
asdf global erlang 20.2.2

# Elixir
asdf plugin-add elixir https://github.com/asdf-vm/asdf-elixir.git
asdf install elixir 1.6.1
asdf global elixir 1.6.1

# Rust
asdf plugin-add rust https://github.com/code-lever/asdf-rust.git
asdf install rust stable
asdf global rust stable

# PostgreSQL
asdf plugin-add postgres https://github.com/smashedtoatoms/asdf-postgres.git
asdf install postgres 10.1
asdf global postgres 10.1

Notes

Feel free to modify script and install different versions of packages or different databases.

Other helpful resources

Finally this article should be helpful for setup Ecto 2 based project:

OvermindDL1 · February 7, 2018, 8:04pm

I really need to give asdf a try sometime… ^.^;

Eiji · February 7, 2018, 8:06pm

It’s not a “tool from god”, but it’s very easy to setup and it have very simple commands, so I think that’s best answer for all members (including beginners).

axelson · February 8, 2018, 9:30pm

Plus it supports so many different languages. I was so tired of having separate version managers for each language, it’s super nice to have just one version manager.