Hi
How can I crawl a website content and save the contents to database?
Any favorite library?
Thanks,
Ani
Hi
How can I crawl a website content and save the contents to database?
Any favorite library?
Thanks,
Ani
I usually use httpoison with floki. There has been some discussion about it
You can replace floki with
For any interaction with a database (postgres by default) you can use ecto.
So my approach would be (roughly) like this:
defmodule Crawler do
def crawl!(url) do
%HTTPoison.Response{body: body, status: 200} = HTTPoison.get!(url)
html = Floki.parse(body)
contents = Floki.find(html, "article") # or whatever you are interested in
# see ecto docs to understand what Repo does
Crawler.Repo.insert!(%Article{contents: contents})
urls =
html
|> Floki.find("a")
|> Enum.map(fn anchor -> Floki.attribute(anchor, "href") end)
# spawn more tasks to crawl other pages, or keep crawling in the current process
end
end
thank you for answering, I wondered how can I use meeseeks or html5ever, it seems like it built with other languages (rust)
and how to spawn the process or crawl any link on the page?
Thanks
You can have a pool (poolboy, tutorial) of scraping processes (may be genservers), each of which would accept a url to scrape (in handle_call
), save the contents to a database, and return the urls that it has found (in {:reply, urls, state}
). Then those urls would be fed back into the pool.
I wondered how can I use meeseeks or html5ever, it seems like it built with other languages (rust)
I guess you would have to have a rust compiler installed. See https://github.com/mischov/meeseeks#dependencies
Couldn’t have said it better.
You might look at Crawler or Crawlie for examples of more complete crawling solutions.
It has an elixir front-end. The dependency itself manages the rust part (all you need is rust/cargo installed).
I quite like meeseeks, it has such a fantastic query API, and it is blazing fast, and it works on XML too.
@suryaniwati: Let me summary everything:
We have already talked about tools for collect (scrape) data from web pages here:
Easiest way to validate and save data into database is using Ecto library:
And for your last question:
It’s as simple as install rust stable
(using asdf tool). The rest is as easy as add elixir dependencies and compile project.
asdf
asdf
asdf
asdf
# Contents of file: "setup_environment.sh"
#!/bin/bash
# asdf
git clone https://github.com/asdf-vm/asdf.git ~/.asdf --branch v0.4.0
echo -e '\n. $HOME/.asdf/asdf.sh' >> ~/.bashrc
echo -e '\n. $HOME/.asdf/completions/asdf.bash' >> ~/.bashrc
source ~/.bashrc
# Erlang
asdf plugin-add erlang https://github.com/asdf-vm/asdf-erlang.git
asdf install erlang 20.2.2
asdf global erlang 20.2.2
# Elixir
asdf plugin-add elixir https://github.com/asdf-vm/asdf-elixir.git
asdf install elixir 1.6.1
asdf global elixir 1.6.1
# Rust
asdf plugin-add rust https://github.com/code-lever/asdf-rust.git
asdf install rust stable
asdf global rust stable
# PostgreSQL
asdf plugin-add postgres https://github.com/smashedtoatoms/asdf-postgres.git
asdf install postgres 10.1
asdf global postgres 10.1
Finally this article should be helpful for setup Ecto
2 based project:
I really need to give asdf a try sometime… ^.^;
It’s not a “tool from god”, but it’s very easy to setup and it have very simple commands, so I think that’s best answer for all members (including beginners).
Plus it supports so many different languages. I was so tired of having separate version managers for each language, it’s super nice to have just one version manager.