Crawler Data

Hi

How can I crawl a website content and save the contents to database?

Any favorite library?

Thanks,
Ani

I usually use httpoison with floki. There has been some discussion about it

You can replace floki with

For any interaction with a database (postgres by default) you can use ecto.

So my approach would be (roughly) like this:

defmodule Crawler do
  def crawl!(url) do
    %HTTPoison.Response{body: body, status: 200} = HTTPoison.get!(url)
    html = Floki.parse(body)
    contents = Floki.find(html, "article") # or whatever you are interested in
    # see ecto docs to understand what Repo does
    Crawler.Repo.insert!(%Article{contents: contents})
    urls =
      html
      |> Floki.find("a")
      |> Enum.map(fn anchor -> Floki.attribute(anchor, "href") end)
    # spawn more tasks to crawl other pages, or keep crawling in the current process
  end
end
6 Likes

thank you for answering, I wondered how can I use meeseeks or html5ever, it seems like it built with other languages (rust)
and how to spawn the process or crawl any link on the page?

Thanks

You can have a pool (poolboy, tutorial) of scraping processes (may be genservers), each of which would accept a url to scrape (in handle_call), save the contents to a database, and return the urls that it has found (in {:reply, urls, state}). Then those urls would be fed back into the pool.

I wondered how can I use meeseeks or html5ever, it seems like it built with other languages (rust)

I guess you would have to have a rust compiler installed. See https://github.com/mischov/meeseeks#dependencies

1 Like

Couldn’t have said it better.

@suryaniwati

You might look at Crawler or Crawlie for examples of more complete crawling solutions.

1 Like

It has an elixir front-end. The dependency itself manages the rust part (all you need is rust/cargo installed).

I quite like meeseeks, it has such a fantastic query API, and it is blazing fast, and it works on XML too. :slight_smile:

@suryaniwati: Let me summary everything:

Similar topic

We have already talked about tools for collect (scrape) data from web pages here:

Database

Easiest way to validate and save data into database is using Ecto library:

Rust dependencies

And for your last question:

It’s as simple as install rust stable (using asdf tool). The rest is as easy as add elixir dependencies and compile project. :smile:

Full setup

  1. Ensure you have dependencies for compile sources (depends on your OS/distro)
  2. Install asdf
  3. Install Erlang plug-in for asdf
  4. Install Erlang
  5. Set Erlang version
  6. Install Elixir plug-in for asdf
  7. Install Elixir
  8. Set Elixir version
  9. Add Rust plug-in for asdf
  10. Install Rust
  11. Set Rust version
  12. Add PostgreSQL plug-in for asdf
  13. Install PostgreSQL database
  14. Set PostgreSQL version

Example script

# Contents of file: "setup_environment.sh"

#!/bin/bash

# asdf
git clone https://github.com/asdf-vm/asdf.git ~/.asdf --branch v0.4.0
echo -e '\n. $HOME/.asdf/asdf.sh' >> ~/.bashrc
echo -e '\n. $HOME/.asdf/completions/asdf.bash' >> ~/.bashrc
source ~/.bashrc

# Erlang
asdf plugin-add erlang https://github.com/asdf-vm/asdf-erlang.git
asdf install erlang 20.2.2
asdf global erlang 20.2.2

# Elixir
asdf plugin-add elixir https://github.com/asdf-vm/asdf-elixir.git
asdf install elixir 1.6.1
asdf global elixir 1.6.1

# Rust
asdf plugin-add rust https://github.com/code-lever/asdf-rust.git
asdf install rust stable
asdf global rust stable

# PostgreSQL
asdf plugin-add postgres https://github.com/smashedtoatoms/asdf-postgres.git
asdf install postgres 10.1
asdf global postgres 10.1

Notes

  1. Feel free to modify script and install different versions of packages or different databases.

Other helpful resources

Finally this article should be helpful for setup Ecto 2 based project:

2 Likes

I really need to give asdf a try sometime… ^.^;

1 Like

It’s not a “tool from god”, but it’s very easy to setup and it have very simple commands, so I think that’s best answer for all members (including beginners).

3 Likes

Plus it supports so many different languages. I was so tired of having separate version managers for each language, it’s super nice to have just one version manager.

2 Likes