New to Elixir? Want to make a few bucks learning with some simple scraping projects?

UPDATE1 - See end of post

The Project

I need to scrape the content from several sites all into a common format, so that I can create a unified database of licensed physicians. Links to the various state licensing boards can be found here:
http://www.ndverify.com/license.htm

There are a few states with APIs, but many will require screen scraping.

The Pitch

Screen scraping projects are my favorite ways to learn a new language, because the work is very similar to common real-world tasks. Lots of HTTP work, parsing, data processing, error handling, persistence, and async work.

IMO, this is more useful than doing another Game of Life clone, or brain teaser koans which don’t require learning about the package ecosystem, external API calls, etc.

What’s more, you can learn from what I’ve figured out so far (I’ve been using Elixir for about 6 months now), by contributing to this framework.
https://github.com/jeffdeville/nhverify

This is great, because while this project is using some advanced concepts like Ecto 2.1, and GenStage, you don’t have to understand it all in order to work on it. I’ve abstracted that complexity out. So you can pick it up at your own pace. What you will get into quickly is functional concepts, HTML/JSON parsing, and ExUnit.

Caveat

I have been writing software for quite awhile in Ruby, Python, Go, Node, c#, and Elixir. This is me: Feel free to send me an invite: https://linkedin.com/in/jeffdeville But be aware: my name is NOT José Valim. Many of these concepts are new to me as well, so there’s every likelihood that you’ll find areas for improvement. That’d be awesome!

Future

This can be a small project, where you just implement 1 scraper, or a larger one, where you tackle several. If things work out, you can help me work on the search engine this data is pipelining in to, payment processing and member services in a Phoenix app, or whitelabeling the site for clients.

Money

Here’s the rub. This is a side project that really won’t ever pay me a ton of money, which means I don’t have a ton to pay you. This is primarily a great way to learn useful skills while contributing to a great project tuned for new learners, and making some beer money. I’m thinking ~300/site on average, though we’d have to look at each candidate to verify.

Update 1

I didn’t expect such a response! There’s been several entirely reasonable questions about details of the scraping work. So here’s some more detail:

Each of these sites follows basically the same formula. A search results page that we can hard code the links to as starting points. And then the search details page for each physician. Straight forward. The challenge is that the HTML structure is pretty wretched in many cases, so figuring out how to extract the data on the page can be a bit of a pain. For example: https://hrlb.oregon.gov/OBNM/licenseelookup/detail.asp?num=1002587&searchby=INDEXNAME&searchfor=A&stateselect=NONE This html isn’t even valid, and the encoding isn’t handled automaticaly by Floki. There aren’t many useful css classes to help either.

Your destination data model is a single struct with some minimal validation: https://github.com/jeffdeville/nhverify/blob/master/lib/licensee.ex

What I need to do, is add some decent docs, and configure travis or circle to auto-run the specs. I’ll get to that as soon as possible!

8 Likes

Interesting. :slight_smile: I wrote/tried to write a simple image scraper for a free image site when starting Haskell (failed) and Go (succeeded, in a non-polished way). I’ll keep your project in mind as I start to learn Elixir in case I end up thinking I may be able to help.

1 Like

Thanks David! If it helps, here’s how I’ve scraped Oregon:

Code:

https://github.com/jeffdeville/nhverify/blob/master/lib/producers/oregon.ex

Tests:

https://github.com/jeffdeville/nhverify/blob/master/test/producers/oregon_test.exs

Target Site:

https://hrlb.oregon.gov/OBNM/licenseelookup/searchdir.asp?searchby=lastname&stateselect=none&searchfor=A&Submit=Search

Good luck in your travels, David. Hope to hear from you if the fit looks right.

Jeff

2 Likes

@jeffdeville I am pretty interested in that – and plus, I always liked scrapers. I already have some experience with scraping in Elixir.

Can’t give you a hard commitment because I am loyal to my employer and the job is tough and taxing, but when I get a free hour here and there, I like to poke in non-trivial problems through Elixir, just like you.

Give me a whisper. We can get something off the ground but no guarantees on speed.

Sounds interesting have a good amount of spare time at the moment so would be glad to work on something like this.

I’m really interested in.

I started think about Elixir in May, but most attention I gave from August this year, so I’m almost new. I learned lots of languages, because I searched Elixir like language to much my future plans, so if I understand what to do then I just do it.

I need only some hours/days (for example: make simple tests) to learn about API from dependencies and your code.

What exactly you mean by: “beer money”? What currency and what beer? haha
Currency of my country is about 4-5 times lower than the EUR and USD.
Currently I need less than 50USD per month for my roommates (and I have redeemed already for the full year). Of course I don’t require 50USD per month, but with this I can focus primary on it.

Currently I’m working on porting (and adding some features to port) example program (from Ruby) called Vexil for new Elixir book. From middle of December (currently developer that checks my work is busy) I will continue making Elixir examples for atom-beautify (original => expected) pairs, so developers can add Elixir language support for this atom package.

“Beer money” will be really helpful, because I want to learn Elixir and add some Elixir related work to my CV before I start looking for real jobs (Elixir or Ruby), but I do not participate in any projects that give me some income, and money (even little) always enjoy us.
:smile:

P.S.
Your blog linked in your github your page is not responding.
Your server (probably nginx) don’t catch www. prefix in url and location /blog returns 404 error page from your site.
:icon_sad:

P.S.2
I fast look at your code and I have some suggestions for it. I can fork it and make some PR`s in next days.

1 Like

I am interested as well. I have been looking at trying out Elixir and learning a new language, give myself challenges. Please let us know what you have decided.

2 Likes

Sorry all, this post was from a year ago. Eiji did most of the scraping work. I’ll keep you all in mind if I can find some other work to do though!

3 Likes

Sure, thank you.

1 Like