DataQuacker - a library for parsing non-sandboxed data like CSV files

fiodorbaczynski · August 13, 2019, 11:06am

Hi all,

I often find it tedious to parse CSV files. There are often complex rules for validating and transforming the data, matching columns by their indexes or header values, etc. Sometimes there is also a need to map one row in the CSV file into multiple elements in the output list, or to skip some rows or fields entirely based on their values. On top of that the resulting structure may have to be a nested map (in case of associations).

That’s why I created an Elixir library DataQuacker which features a simple DSL to describe the output structure along with how it should be mapped from the source and all the validation and transformation rules for each field or row.

You can find the library at Github and the docs at Hexdocs

To get a glimpse of the DSL, take a look at this relatively simple schema example from the docs:

defmodule StudentsSchema do
  use DataQuacker.Schema

  schema :students do
    field :first_name do
      source("first name")
    end

    field :last_name do
      source("last name")
    end

    field :age do
      transform(fn age ->
        case Integer.parse(age) do
          {age_int, _} -> {:ok, age_int}
          :error -> {:error, "Invalid value #{age} given"}
        end
      end)

      source("age")
    end

    field :favourite_subject do
      validate(fn subj -> subj in ["Maths", "Physics", "Programming"] end)

      source("favourite subject")
    end
  end
end

There are many more features, like arbitrarily nesting fields, matching columns with regex and custom functions, skipping rows, outputting multiple rows from one source row, injecting support data to validators, transformers, etc., using metadata to give error messages with detailed information about where an error occurred, and so on.

All of those are described in the docs along with examples and detailed explanations.

Another nice thing about the DSL is that it will give the user helpful errors at compile time if something is not right.

For now only CSV files and Elixir data is supported as the source out of the box, but anyone can write an adapter for the source they want to use, e.g. Google Sheets. It’s easy to write one, to learn more take a look at the Adapter behaviour in the docs).

Any feedback is greatly appreciated. Please tell me what you think, especially if you find any bugs, missing functionality or unclear / incomplete documentation.