Following up from his talk at ElixirConf EU Virtual, our colleague Oleg Tarasenko will be joining us on the webinar to dive deeper into Crawly, the web scraping framework he created in Elixir.
In this webinar he will discuss what web scraping is, why it is valuable and how Crawly makes it easy.
The webinar will demonstrate a real example using the Elixir Radar job board.
Register at https://www2.erlang-solutions.com/crawlywebinar2
I would surely welcome some more intermediate/advanced guides on web scraping. Almost all blogs/tutorials on this topic is comprised of; 1. install lib 2. basic xpath selectors 3. save to csv.
Things I am wondering about:
Persistence strategies - do we save the html to a object storage, then scrape it and save data we need to database?
Recurrent scraping - how to scrape the same pages over a period of time? Strategies for good logging for error detection when a page has changed? How do we handle incremental updates on a field or web page?
Spider structuring - do you write a more general spider that can work for general fields across many web sites, and have more custom spiders to get “special” data from each page, or do we write a custom spider for each page?
Spider orchestration - how do we monitor these x number of spiders and scheduling? How do we prevent ddos’ing and get banned?
Probably more stuff that I even don’t know that I don’t know about.
If anyone has any available resources, please share
Hey Joddm,
Thanks for the reply. I will pass this on to Oleg from our team who is hosting the webinar. He will likely have some valuable information on the above.
@joddm noted!
I think you’re rising fair points. I am still wiring the presentation and the plot for the webinar, and will try to address your concerns!
We will build it interactively, so it will be possible to have a real-time conversation! In any case, it would be nice to make the experience exchange.
Thanks, just joined!
Can you confirm the date and time for the event? The confirmation email says it’s:
Wed, Jul 1, 2020 5:30 AM - 6:30 AM BST
But the web says:
17:30pm BST on July 8.
I hope it’s the second one
Also, will this be recorded?
Sorry for the mix up.
This is tomorrow, July 1st.
The website date was out of date temporarily.
The webinar will be recorded and all registrants will receive a copy via email.