UPDATE1 - See end of post
I need to scrape the content from several sites all into a common format, so that I can create a unified database of licensed physicians. Links to the various state licensing boards can be found here:
There are a few states with APIs, but many will require screen scraping.
Screen scraping projects are my favorite ways to learn a new language, because the work is very similar to common real-world tasks. Lots of HTTP work, parsing, data processing, error handling, persistence, and async work.
IMO, this is more useful than doing another Game of Life clone, or brain teaser koans which don’t require learning about the package ecosystem, external API calls, etc.
What’s more, you can learn from what I’ve figured out so far (I’ve been using Elixir for about 6 months now), by contributing to this framework.
This is great, because while this project is using some advanced concepts like Ecto 2.1, and GenStage, you don’t have to understand it all in order to work on it. I’ve abstracted that complexity out. So you can pick it up at your own pace. What you will get into quickly is functional concepts, HTML/JSON parsing, and ExUnit.
I have been writing software for quite awhile in Ruby, Python, Go, Node, c#, and Elixir. This is me: Feel free to send me an invite: https://linkedin.com/in/jeffdeville But be aware: my name is NOT José Valim. Many of these concepts are new to me as well, so there’s every likelihood that you’ll find areas for improvement. That’d be awesome!
This can be a small project, where you just implement 1 scraper, or a larger one, where you tackle several. If things work out, you can help me work on the search engine this data is pipelining in to, payment processing and member services in a Phoenix app, or whitelabeling the site for clients.
Here’s the rub. This is a side project that really won’t ever pay me a ton of money, which means I don’t have a ton to pay you. This is primarily a great way to learn useful skills while contributing to a great project tuned for new learners, and making some beer money. I’m thinking ~300/site on average, though we’d have to look at each candidate to verify.
I didn’t expect such a response! There’s been several entirely reasonable questions about details of the scraping work. So here’s some more detail:
Each of these sites follows basically the same formula. A search results page that we can hard code the links to as starting points. And then the search details page for each physician. Straight forward. The challenge is that the HTML structure is pretty wretched in many cases, so figuring out how to extract the data on the page can be a bit of a pain. For example: https://hrlb.oregon.gov/OBNM/licenseelookup/detail.asp?num=1002587&searchby=INDEXNAME&searchfor=A&stateselect=NONE This html isn’t even valid, and the encoding isn’t handled automaticaly by Floki. There aren’t many useful css classes to help either.
Your destination data model is a single struct with some minimal validation: https://github.com/jeffdeville/nhverify/blob/master/lib/licensee.ex
What I need to do, is add some decent docs, and configure travis or circle to auto-run the specs. I’ll get to that as soon as possible!