Selective scraper and persisting changes over time

revati · October 24, 2018, 8:10pm

Hello everyone,

I have an idea that I want to scrape some HTML/JSON/XML sources. I want to fetch contents of root URL and then depending on source type, maybe extract some more URLs and scrape those as well.

In the end, I would like to scrape those resources regularly, so I would have diffs have one specific source (containing one or more URL contents) have changed over time.

If I in UI select one source I open list of all revisions of this source, opening one revision, I would see that particular state. (Something like gists in GitHub, can contain multiple files).

I would also be able to see diff history, how one revision diffs from another one.

Each scrape session would collect raw response. and diffing would be made on demand.

One possible solution would be to make git repo and scraper would store contents in {source name}/{date}/{all sources here}.html and make commits. Something like automated git repo. Then all diffing and revisions would already be handled by git itself.

I have approximated for now that there could be something like few thousands of sources. Each scraped in dynamic timeframes (but mostly daily/some hourly). So there would be tons of commits daily. and that git repo would get pretty big fast. (don’t know enough about git, maybe that isn’t a problem). The potential issue, concurrent scrapes would be committed in wringing commits. How to implement transactions?

More used I would be to some relational db, where data are stored.

sources
id | url | type

scrapes (odd name)
id | source_id | revision | timestamp

contents
id | scrape_id | contents (long text???)

Ideally, it would be CQRSish, at least Event soured in which case

%SourceAdded{id, url, type}
%SourceScraped{id, source_id, revision, timestamp, [url => contents]} // Lsit of contents as there can be many sources of data (root one is that url is equal to source url)

Later there will be heavy duty scraper, where facts will be extracted from those sources, but that is next level.

P.S. This probably is my rambling and trying to formalize for myself what exactly I need.

mjaric · October 24, 2018, 9:33pm

Why not using git as a storage? You can even make it as distributed storage so nodes can merge remote branches and sync over time.

revati · October 25, 2018, 5:57am

Yeah, that was one of my ideas hot to handle it. Will have to look into it, how it could work. Haven’t used git for such dynamics.