Hi
How can I crawl a website content and save the contents to database?
Any favorite library?
Thanks,
Ani
Hi
How can I crawl a website content and save the contents to database?
Any favorite library?
Thanks,
Ani
I usually use httpoison with floki. There has been some discussion about it
You can replace floki with
For any interaction with a database (postgres by default) you can use ecto.
So my approach would be (roughly) like this:
defmodule Crawler do
def crawl!(url) do
%HTTPoison.Response{body: body, status: 200} = HTTPoison.get!(url)
html = Floki.parse(body)
contents = Floki.find(html, "article") # or whatever you are interested in
# see ecto docs to understand what Repo does
Crawler.Repo.insert!(%Article{contents: contents})
urls =
html
|> Floki.find("a")
|> Enum.map(fn anchor -> Floki.attribute(anchor, "href") end)
# spawn more tasks to crawl other pages, or keep crawling in the current process
end
end
thank you for answering, I wondered how can I use meeseeks or html5ever, it seems like it built with other languages (rust)
and how to spawn the process or crawl any link on the page?
Thanks
You can have a pool (poolboy, tutorial) of scraping processes (may be genservers), each of which would accept a url to scrape (in handle_call
), save the contents to a database, and return the urls that it has found (in {:reply, urls, state}
). Then those urls would be fed back into the pool.
I wondered how can I use meeseeks or html5ever, it seems like it built with other languages (rust)
I guess you would have to have a rust compiler installed. See GitHub - mischov/meeseeks: An Elixir library for parsing and extracting data from HTML and XML with CSS or XPath selectors.
Couldn’t have said it better.
You might look at Crawler or Crawlie for examples of more complete crawling solutions.
It has an elixir front-end. The dependency itself manages the rust part (all you need is rust/cargo installed).
I quite like meeseeks, it has such a fantastic query API, and it is blazing fast, and it works on XML too.
@suryaniwati: Let me summary everything:
We have already talked about tools for collect (scrape) data from web pages here:
Easiest way to validate and save data into database is using Ecto library:
And for your last question:
It’s as simple as install rust stable
(using asdf tool). The rest is as easy as add elixir dependencies and compile project.
asdf
asdf
asdf
asdf
# Contents of file: "setup_environment.sh"
#!/bin/bash
# asdf
git clone https://github.com/asdf-vm/asdf.git ~/.asdf --branch v0.4.0
echo -e '\n. $HOME/.asdf/asdf.sh' >> ~/.bashrc
echo -e '\n. $HOME/.asdf/completions/asdf.bash' >> ~/.bashrc
source ~/.bashrc
# Erlang
asdf plugin-add erlang https://github.com/asdf-vm/asdf-erlang.git
asdf install erlang 20.2.2
asdf global erlang 20.2.2
# Elixir
asdf plugin-add elixir https://github.com/asdf-vm/asdf-elixir.git
asdf install elixir 1.6.1
asdf global elixir 1.6.1
# Rust
asdf plugin-add rust https://github.com/code-lever/asdf-rust.git
asdf install rust stable
asdf global rust stable
# PostgreSQL
asdf plugin-add postgres https://github.com/smashedtoatoms/asdf-postgres.git
asdf install postgres 10.1
asdf global postgres 10.1
Finally this article should be helpful for setup Ecto
2 based project:
I really need to give asdf a try sometime… ^.^;
It’s not a “tool from god”, but it’s very easy to setup and it have very simple commands, so I think that’s best answer for all members (including beginners).
Plus it supports so many different languages. I was so tired of having separate version managers for each language, it’s super nice to have just one version manager.