Your opinions on a web crawler architecture please

r6203 · March 26, 2018, 11:40pm

Hi,

I’ve never done any Elixir programming – just reading a bunch of stuff about it.

However, Elixir always draws my attention when I’m working on concurrent stuff in other languages; I mean… the ability to spawn 1000 of processes? That sounds just cool. And, of course, the pipe operator…

Anyway, I’m working on a quite interesting project in Python and I think, Elixir will be a great fit for it…

… and I like your opinion on a “proper” architecture.

Basically, it’s a crawler which crawls certain sites, extracting a list of entries… crawls the website of each entry and stores the result in a database.

At the moment (the Python solution), I crawl the listing site, store each entry in a list, loop through all the entries and crawling them 1-by-1.

Here are my thoughts on structuring the whole application in Elixir:

First, I start the crawling process of the listing site. Each entry found is “transferred” to another process which crawls this entry for certain things.

I love the idea of spinning up a separate crawling process for each entry found because they are not related to each other. Is this feasible?

I plan to create an umbrella project and want to separate the crawling of the listing site, the crawling of each individual entry sites and the persistence separated as much as possible.

So the crawling of the listing site would be an otp application which sends a message for each entry found with the entry as the payload.

“Someone” listens for that message (the supervisor?) and spawns a process to crawl the entry when receiving such a message.

As I plan to create another otp application for the crawling of the entries, is spawning a process the right term here? In fact it’s a whole application and not just a function that needs to get started.

Last, I send another message when an entry is crawled which is then stored via Ecto by my 3rd application.

As I said in the beginning, I’ve never programmed in Elixir so my structure could be ridiculous…

Any thoughts on this?

Thanks!

R

mischov · March 27, 2018, 12:17am

Here are a couple Elixir crawling frameworks that might give you an idea of how others have approached the problem: Crawlie, Crawler.

Depending on how structured the crawl is (for instance, if you know each of the entry sites ahead of time) you could create specific clients for each site, perhaps even stateful, GenServer-ish clients, but I’m not sure if I’d want to create different OTP applications for each site.

Finally, it’s worth considering whether or not Elixir offers anything better than Scrapy for large scraping projects, since you’re already in Python.

r6203 · March 27, 2018, 5:39pm

Thanks for the links.

As for Python/Scrapy vs Elixir… the Elixir project is going to be a pure learning project so I’m not that worried about some tradeoffs.

Regarding the “specific clients for each site”, you mentioned: The only site that needs to get special parsing rules etc. is the listing page. Each entry found on the listing page is parsed in a generic way so I don’t need different clients for the entry websites.

I’ve got one question regarding supervisors/genserver:

Assuming that I’ve got a module for handling the listing sites and a module to handle individual entry sites, how could I trigger a new “parsing job” for an entry once the listing site crawler found one?

I was thinking about creating a supervisor, who’s job is it to start the initial crawling of the listing site. The supervisor also stores a queue of all entries found so far.

What I then need is some way to pass a message from the listing crawler back to the supervisor once an entry is found.
Once the supervisor receives such a message, he stores the entry in the queue and launches the parsing of the entry.

So, is it possible in Elixir to communicate back from a worker to the calling process? I think I could pass the pid of the supervisor to the listing crawler when starting it but this doesn’t seem right.

kokolegorille · March 27, 2018, 5:52pm

Task is a special GenServer that would fit this use case.

task = Task.async(fn -> do_some_scrapy() end)
result = Task.await(task)

minhajuddin · March 27, 2018, 6:23pm

Spawning a task for each url may not be the best way to handle this, you may end up DOSing the website if you don’t do this in a controlled fashion. A few things to consider:

Put the links to be crawled in a database.
Have a pool of workers which can be configured, e.g. 10 workers.
If fetching or parsing of a url fails, push it to a log or some place where you can see it. Also, this shouldn’t crash the whole app.
Your app should be able to pick up where it left, if it crashes or is killed.

mischov · March 27, 2018, 6:38pm

If you don’t want to pass in the supervisor’s pid, register a name for the supervisor.

Process.register(supervisor_pid, :supervisor_name)

kokolegorille · March 27, 2018, 7:51pm

I use Task with poolboy when crawling multiple urls.

OvermindDL1 · March 27, 2018, 7:55pm

Or perhaps GenStage or so. Limit it to a max number of active ‘processes’ at a time and feed in a list of URL’s (which of course can back-feed it through to keep adding more to the streaming input too).

kokolegorille · March 27, 2018, 8:01pm

I am just trying to move my crawling pipeline to GenStage

Not done yet. But way better than the way I do it now… Under heavy load, my pipeline breaks because consumers cannot handle pressure.

r6203 · March 29, 2018, 1:31am

The steps you mention are definitely the things I want to accomplish.

None of the sites I want to crawl are related to another so firing up a process / or a worker from a pool isn’t a problem, I think.
On top of that, I’m going to limit my requests to 1 every 10 seconds per site… just to be nice.