Data_schema - declarative schemas for data transformations

Adzz · November 24, 2021, 2:22pm

Data schemas are declarative descriptions of how to create a struct from some input data. You can set up different schemas to handle different kinds of input data. By default we assume the incoming data is a map, but you can configure schemas to work with any arbitrary data input including XML and json.

Data is selected from the input data and passed to a casting function before being set as a value under a key on the struct you want to build.

Check out the docs / guides and README for more detailed information on how it works but below is a flavour of what you can do.

A simple struct

First, let’s assume that your input data is a map with string keys. DataSchemas really shine when working with APIs because we can quickly convert an API response into trusted elixir data:

input = %{
  "content" => "This is a blog post",
  "comments" => [%{"text" => "This is a comment"},%{"text" => "This is another comment"}],
  "draft" => %{"content" => "This is a draft blog post"},
  "date" => "2021-11-11",
  "time" => "14:00:00",
  "metadata" => %{ "rating" => 0}
}

Now let’s define a schema to create a BlogPost struct from the above input data:

defmodule BlogPost do
  import DataSchema, only: [data_schema: 1]

  data_schema([
    field: {:content, "content", &BlogPost.to_okay_string/1},
  ])
  
  def to_okay_string(value) do
     {:ok, to_string(value)}
  end
end

The above is equivalent to:

defmodule StringType do
  @behaviour DataSchema.CastBehaviour

  @impl true
  def cast(value) do
    {:ok, to_string(value)}
  end
end

defmodule BlogPost do
  import DataSchema, only: [data_schema: 1]

  data_schema([
    field: {:content, "content", StringType},
  ])
end

Now you have defined your schema you can simple call DataSchema.to_struct/2:

DataSchema.to_struct(input, BlogPost)
# => %BlogPost{content: "This is a blog post"}

A more complex example

You can define a few kinds of fields, see the docs for more info but here is a more complex example introducing more field types:

  defmodule DraftPost do
    import DataSchema, only: [data_schema: 1]
    data_schema(field: {:content, "content", StringType})
  end

  defmodule Comment do
    import DataSchema, only: [data_schema: 1]
    data_schema(field: {:text, "text", StringType})
  end

  defmodule BlogPost do
    import DataSchema, only: [data_schema: 1]

    @mapping [
      field: {:date, "date", &Date.from_iso8601/1},
      field: {:time, "time", &Time.from_iso8601/1}
    ]
    data_schema(
      field: {:content, "content", &DataSchemaTest.to_stringg/1},
      has_many: {:comments, "comments", Comment},
      has_one: {:draft, "draft", DraftPost},
      list_of: {:list_of, "comments", &{:ok, &1["text"]} },
      aggregate: {:post_datetime, @mapping, &BlogPost.to_datetime/1}
    )

    def to_datetime(%{date: date, time: time}) do
      NaiveDateTime.new(date, time)
    end
  end

DataSchema.to_struct(input, BlogPost)
# The above returns:
{:ok, %DataSchemaTest.BlogPost{
  list_of: ["This is a comment", "This is another comment"],
  comments: [
    %DataSchemaTest.Comment{text: "This is a comment"},
    %DataSchemaTest.Comment{text: "This is another comment"}
  ],
  content: "This is a blog post",
  draft: %DataSchemaTest.DraftPost{content: "This is a draft blog post"},
  post_datetime: ~N[2021-11-11 14:00:00]
}}

Different Input Data - aka Are these not just embedded_schemas from ecto?

The examples so far have shown functionality that is very similar to what you can get from Ecto’s embedded schemas and data casting capabilities. However, in DataSchema we can also provide different data accessors. This allows us to defines schemas that can be casted from different input data, for example…

XML Schemas

Let’s imagine that we have some XML that we wish to turn into a struct. What would it require to enable that? First a new Xpath data accessor:

defmodule XpathAccessor do
  @behaviour DataSchema.DataAccessBehaviour
  import SweetXml, only: [sigil_x: 2]

  @impl true
  def field(data, path) do
    SweetXml.xpath(data, ~x"#{path}"s)
  end

  @impl true
  def list_of(data, path) do
    SweetXml.xpath(data, ~x"#{path}"l)
  end

  @impl true
  def has_one(data, path) do
    SweetXml.xpath(data, ~x"#{path}")
  end

  @impl true
  def has_many(data, path) do
    SweetXml.xpath(data, ~x"#{path}"l)
  end
end

Let’s define our schemas like so:

defmodule DraftPost do
  import DataSchema, only: [data_schema: 1]

  @data_accessor XpathAccessor
  data_schema([
    field: {:content, "./Content/text()", StringType}
  ])
end

defmodule Comment do
  import DataSchema, only: [data_schema: 1]

  @data_accessor XpathAccessor
  data_schema([
    field: {:text, "./text()", StringType}
  ])
end

defmodule BlogPost do
  import DataSchema, only: [data_schema: 1]

  @data_accessor XpathAccessor
  @datetime_fields [
    field: {:date, "/Blog/@date", &Date.from_iso8601/1},
    field: {:time, "/Blog/@time", &Time.from_iso8601/1},
  ]
  data_schema([
    field: {:content, "/Blog/Content/text()", StringType},
    has_many: {:comments, "//Comment", Comment},
    has_one: {:draft, "/Blog/Draft", DraftPost},
    aggregate: {:post_datetime, @datetime_fields, &NaiveDateTime.new(&1.date, &1.time)},
  ])
end

And now we can transform as above:

source_data = """
<Blog date="2021-11-11" time="14:00:00">
  <Content>This is a blog post</Content>
  <Comments>
    <Comment>This is a comment</Comment>
    <Comment>This is another comment</Comment>
  </Comments>
  <Draft>
    <Content>This is a draft blog post</Content>
  </Draft>
</Blog>
"""

DataSchema.to_struct(source_data, BlogPost)

# This will output:

{:ok, %BlogPost{
   comments: [
     %Comment{text: "This is a comment"},
     %Comment{text: "This is another comment"}
   ],
   content: "This is a blog post",
   draft: %DraftPost{content: "This is a draft blog post"},
   post_datetime: ~N[2021-11-11 14:00:00]
 }}

Data Accessor - An Access example.

Let’s look back at our map version.

input = %{
  "content" => "This is a blog post",
  "comments" => [%{"text" => "This is a comment"},%{"text" => "This is another comment"}],
  "draft" => %{"content" => "This is a draft blog post"},
  "date" => "2021-11-11",
  "time" => "14:00:00",
  "metadata" => %{ "rating" => 0}
}

We could define a data accessor that looks like this:

defmodule AccessDataAccessor do
  @behaviour DataSchema.DataAccessBehaviour

  @impl true
  def field(data, path) do
    get_in(data, path)
  end

  @impl true
  def list_of(data, path) do
    get_in(data, path)
  end

  @impl true
  def has_one(data, path) do
    get_in(data, path)
  end

  @impl true
  def has_many(data, path) do
    get_in(data, path)
  end
end

Now we can define our schema:

defmodule Blog do
  import DataSchema, only: [data_schema: 1]

  @data_accessor AccessDataAccessor
  data_schema([
    list_of: {:comments, ["comments", Access.all(), "text"], &{:ok, to_string(&1)}},
  ])
end

And create a struct from this:

input = %{
  "content" => "This is a blog post",
  "comments" => [%{"text" => "This is a comment"},%{"text" => "This is another comment"}],
  "draft" => %{"content" => "This is a draft blog post"},
  "date" => "2021-11-11",
  "time" => "14:00:00",
  "metadata" => %{ "rating" => 0}
}
DataSchema.to_struct(input, Blog)
# Returns:
{:ok, %Blog{comments: ["This is a comment", "This is another comment"]}}

This is still an early version. There are some planned upcoming features before a v1 but it is certainly useable as is.

Adzz · November 26, 2021, 1:34pm

Update:

Livebooks added to the repo.

Adzz · December 3, 2021, 2:37pm

I’m also now realising I don’t think I ever actually linked to the repo:

the_wildgoose · December 4, 2021, 5:14pm

This looks interesting.

Can I run validations on my data?
Can I pass these structures to a phoenix form as a changeset compatible struct?

Adzz · December 4, 2021, 9:38pm

Great questions!

Validations

Can I run validations on my data?

Right now the focus is on parsing over validation. What I mean by that is instead of doing something like this:

input = %{"name" => ""}

input
|> DataSchema.to_struct(User)
|> validate_name_not_blank()

Or even:

input = %{"name" => ""}

input
|> validate_name_not_blank()
|> DataSchema.to_struct(User)

we can define our casting function to return an :error if it receives an empty string:

defmodule NonBlankString do
  @behaviour DataSchema.CastBehaviour

  @impl true
  def cast(""), do: {:error, "Field was blank!"}
  def cast(value), do: {:ok, to_string(value)}
end

defmodule User do
  import DataSchema, only: [data_schema: 1]

  data_schema([
    field: {:user, "user", NonBlankString}
  ])
end

My current take on validations is that they are for when you can’t design away the need for them (via making illegal states unrepresentable). So the idea is that the schema defines what is valid.

HOWEVER - as you can see in the above examples you could define your own functions before / after struct creation if you felt the need.

It’s possible that some validations can’t be expressed per field, in which case we could add some in the future.

Phoenix Forms

There is nothing specially added yet for phoenix forms, but off the top of my head there are a few ways you could approach it. One way is to use a schemaless changeset in the form:

  def index(conn, _params) do
    types = %{name: :string}
    user = %User{}
    changeset = Ecto.Changeset.change({user, types}, %{})
    render(conn, "index.html", changeset: changeset)
  end

# With a form like this
<%= form_for @changeset, Routes.user_path(@conn, :create), fn f -> %>
  <label>
    Name: <%= text_input f, :name %>
  </label>
  <%= submit "Submit" %>
<% end %>

Then when you post the form:

def create(conn, %{"user" => user_input}) do
  case DataSchema.to_struct(user_input, User) do
     {:error, error} -> ...
     {:ok, struct} -> ...
  end
end

We could possibly make this easier by supplying a function something like DataSchema.schemaless_changeset_from_schema(User):

  def index(conn, _params) do
    changeset = DataSchema.schemaless_changeset_from_schema(User)
    render(conn, "index.html", changeset: changeset)
  end

You’d also have to do the work of converting the error to a changeset error, which we could probably write some functions to help with, but it might be as easy as:

def create(conn, %{"user" => user_input}) do
  case DataSchema.to_struct(user_input, User) do
     {:error, %{errors: [{field,  message}]}} ->
       changeset =
         {%User{}, %{name: :string}}
         |> Ecto.Changeset.change(user_input)
         |> Ecto.Changeset.add_error(field, message)
         
         render(conn, changeset: changeset)
     {:ok, struct} ->
       render(...)
  end
end

My feel is that ecto might feel more natural, but open to the use case.

Brainiac · December 5, 2021, 5:57am

Looks nice, though to add to the discussion here are some alternative libraries also in this space:

GitHub - vic/params: Easy parameters validation/casting with Ecto.Schema, akin to Rails' strong parameters.
GitHub - bluzky/tarams: Casting and validating external data and request parameters in Elixir and Phoenix
GitHub - akoutmos/pharams: Parameter validation for Elixir Phoenix Framework
GitHub - satom99/paramus: Parameter validation for Phoenix

Adzz · December 5, 2021, 12:43pm

Thanks for sharing.

Like I say DataSchema could be used to help with phoenix forms but that isn’t where it shines because params in Phoenix forms are always maps with string keys.

A really good use case for DataSchema is talking to APIs. If the API is XML then we get ecto-like features for parsing that XML.

with {:ok, %{body: body, status: 200}} <- HTTPoison.post(request) do
  DataSchema.to_struct(body, MySchema)
end

the_wildgoose · December 5, 2021, 2:56pm

Just putting this out there, but… What I REALLY want is something similar to Ecto for JSON/XML…

Meaning I find myself needing to consume some JSON structure. eg:

{
    ....,
    "firewall": {
        "dnat": {
            "nat-in": [
                {
                    "dest": "loc:192.168.111.4",
                    "dport": "7",
                    "proto": "tcp",
                    "source": "net"
                },
                {
                    "dest": "loc:192.168.111.4:80",
                    "dport": "8080",
                    "origdest": "&ppp0",
                    "proto": "tcp",
                    "source": "net"
                }
            ]
        }
    },
    ....,
}

So this is a map of maps of maps, which contains an array of maps.

Now this snippet is part of a much larger JSON structure which has config for other stuff, ie there are other keys at the top level with their own trees under.

Now I want to parse chunks of this into Elixir structures, check that it’s valid before starting, present those to the user as some kind of phoenix/Liveview form, accept back the updated params and validate them (so that I can do instant errors on screen). Finally I want to be able to diff what changed from the original and re-apply it to the current JSON structure

Whilst I’ve been a little over specific on some of my own use case, I don’t think this is so different from a use case you likely have in mind: consume some API end point, present the details to the user, allow them to edit stuff, send the changes back to the API end point?

Things which might not be obvious from the above:

I need to validate fields in combination with each other, so certain params may only be valid if the :action is something specific
It’s REALLY boring mapping a map of maps of lists of maps into Ecto format… Ecto can only cope with representing database tables, so your map of maps ends up needing to become a list structure where you copy the keys in and out (think how you would represent it in an SQL database). It would be SOOO much easier if Ecto could understand something like a map structure in it’s “has_many” fields (yes you can do custom data types like {:map, string}, but then you lose the ability to use schemas and changesets on those fields)
I need to also validate the keys of the maps. They need to be sanitised and controlled for length, etc (as they may map to UI elements, etc)
I need some level of round trip ability
I need to “diff” the changes (so I can apply only the changes back upstream - I want to be granular if simultaneous changes were made to separate parts of the document. However, ideally I want to be able to re-run my “is it valid” after re-apply those changes as two edits might individually not clash, but there might be dependencies between the key values, eg we might have a section for the IP range of the local network and another section for the DHCP parameters, but we have a schema validation between the two as we enforce that one is within the same range as the other.

I solve this at the moment using ecto changesets. The shape of a changeset can cross chunks of the whole json document if needed to enforce cross schema changes. ie a changeset “plucks out” a bunch of fields from the JSON input, and kind of flattens them into the structures allowed within ecto (lists). Then we can run our nested validations, etc. Then unfortunately this needs another function to reverse this process as it’s not necessarily purely mechanical to reverse the original extraction. It’s also painful to represent maps of maps as these need flattening into lists with an id column to represent the map key names (and this reversing later)

What I desire is something like a JSON parser, coupled to a generic structure validator. Which in turn can be used in phoenix forms with functioning error handling (the phoenix error function is something you define, so it can work with any library which produces a validation output including some per field error term)

Does this sound like a direction you are heading in?

Adzz · December 8, 2021, 9:10pm

It’s tricky to know for sure without getting my head round your use case more but it feels like you can get a fair bit of what you want. I’d recommend having a play!

I would say that when validations come from a combination of fields, you can still wrap this up into a casting function. Let’s take a simple example, imagine you have to be over 18 to be an adult:

input = %{ age: 10, type: "adult" }

defmodule User do
  import DataSchema
  
  @type_fields [
     field: {:age, :age, SchemaInteger},
     field: {:type, :type, SchemaString}
  ]
  data_schema([
    aggregate: {:type, @type_fields, &User.type/1},
  ])

  def type(%{age: age, type: :adult}) when age < 18, do: :error
  def type(%{age: _, type: :adult}), do: {:ok, :adult}
end

Adzz · December 15, 2021, 11:02pm

New version released: V 0.2.3

https://hexdocs.pm/data_schema/DataSchema.html

0.2.3

Bug fix

Ensures we call Code.ensure_loaded? before checking if function is exported. This was causing problems when running tests.

0.2.2

Bug fix

We were not creating the nested errors correctly for has_many and has_one, now we do. We also were removing nils when they were allowed for :list_of, we now don’t.

0.2.1

Bug fix

Previously we could not use a :list_of field on an inline :aggregate field. This fixes that.

Adzz · January 16, 2022, 1:59pm

New Version Released!

Version 0.2.4:

Features

This release adds runtime schemas. Runtime schemas are schemas that are defined at runtime and allow for casting to existing structs or to a bare map instead of a struct. This makes it really easy to integrate with Ecto for example to save an XML response into a db.

See the livebook for more details: data_schema/runtime_schemas.livemd at main · Adzz/data_schema · GitHub

Here is a small example of what is possible:

defmodule User do
  use Ecto.Schema

  schema "users" do
    field :name, :string
    field :age, :integer
  end

  def update_details_from_xml(user_id, xml) do
    schema = [
      field: {:name, "/Response/User/@name", &{:ok, &1}},
      field: {:age, "/Response/User/@age", &parse_in/1t}
    ]

    with {:ok, changes} <- DataSchema.to_struct(xml, %{}, schema, XpathAccessor),
      %User{} = user <- Repo.get(user_id, User),
      %{valid?: true} = changeset <- Ecto.Changeset.change(%User{}, changes) do
      Repo.update(changeset)
    end
  end

  defp parse_int(string) do 
    case Integer.parse(string)  do
      {int, _} -> {:ok, int}
      _error -> :error
    end
  end

end

xml = """
<Response>
  <User name="Jeff" age="12" />
</Response>
"""
User.update_details_from_xml("123", xml)

Adzz · March 24, 2022, 2:39pm

New Version(s) Released!

0.2.9

Improvement

Allows for using a MFA tuple ({module, function, arguments}) as a casting function in a data schema. The value extracted from the input data will be set as the first argument in the arguments list.

0.2.8

Improvement

Improve the error message when the cast function does not return an okay tuple.

0.2.7

Improvement

Bump ex_doc to get newer looking docs.

0.2.6

Bug fix

Fix schema validations and error message.

We were not allowing valid schema syntax for an inline schema (ie a runtime schema that is provided at compile time), and our error message was wrong.

Now we correctly allow:

has_many: {:dep, "./Dep", {%{}, @place_schema}},
has_one: {:arrival, "./Arrival", {%{}, @place_schema}},

And similar.

Adzz · April 29, 2022, 10:06pm

New Version Released

0.3.0

Improvement

We now handle errors returned from cast functions by returning a %DataSchema.Errors{} for them, which will effectively point to the field that error’d. Previously we were only doing that for non null errors for some reason! to_struct now never returns :error only, meaning you can now immediately see what caused the error. See the example below:

  defmodule Author do
    import DataSchema, only: [data_schema: 1]
    data_schema(
      field: {:name, "name", fn _ -> :error  end}
    )
  end

  defmodule Comment do
    import DataSchema, only: [data_schema: 1]
    data_schema(
      has_one: {:author, "author", Author}
    )
  end

  defmodule BlagPost do
    import DataSchema, only: [data_schema: 1]

    data_schema(
      field: {:content, "content", &{:ok, to_string(&1)}},
      has_many: {:comments, "comments", Comment},
    )
  end

input = %{
  "content" => "This is a blog post",
  "comments" => [
    %{"author" => %{"name" => "Ted"} }, 
    %{"author" => %{"name" => "Danson"} }
  ]
}
DataSchema.to_struct(input, BlagPost)

Will return:

{:error,
 %DataSchema.Errors{
   errors: [
     comments: %DataSchema.Errors{
       errors: [
         author: %DataSchema.Errors{errors: [name: "There was an error!"]}
       ]
     }
   ]
 }}

Exadra37 · May 4, 2022, 6:51pm

Why do you use the key :errors? It looks redundant, adds noise and doesn’t look useful, but I may be failing to understand it’s propose.

Adzz · May 5, 2022, 4:17pm

It was “errors” because I was trying to leave room for the possibility of collecting all errors in the future.

Right now in data schema if a casting function returns nil when it shouldn’t or if it errors we halt the transformation completely and return an error.

A different approach would be to continue with the schema, but “collect” all other errors along the way. This approach is similar to what Ecto does. In that world you may have multiple errors at each level, hence the need for :errors. But I agree it seems overcomplicated for the library as is.

I could maybe change it and return DataSchema.Errors when (/if) we "collect errors and a DataSchema.Error when we “fail fast” so to speak…

Adzz · July 12, 2022, 1:10pm

New Release! 0.3.1

This release adds three new public functions:

DataSchema.to_runtime_schema/1
DataSchema.flatten_errors/1
DataSchema.to_error_tuple/1

These allow for schema reflection and make working with errors easier.

@Exadra37 the error functions may be of interest to you.

Adzz · September 7, 2022, 8:29pm

New Version 0.4.3

Enhancements

Two new options:

:empty_values
:default

They work in combination to allow you to specify things like [] as an “empty” value for a :list_of field for example.

You can then also specify a :default which will be used if the field is optional and resolves to an empty value.

For example:

  defmodule Sandwich do
    require DataSchema

    DataSchema.data_schema([
      field: {
        :created_at,
        "inserted_at", 
         &Date.from_iso8601/1,
        optional?: true, empty_values: [nil], default: &DateTime.utc_now/0
      },
    ])
  end

:empty_values defaults to [nil].

We also changed it so that the casting function is only called if the field is not empty for all fields, meaning you don’t need to guard against nil in your cast fns.

Adzz · September 29, 2022, 9:51pm

New Version 0.5.0

This version provides much richer information when a casting function unexpectedly raises.

Often we write cast functions as modules that implement a cast function. This means if they raise the stacktrace unhelpfully points to that module and tells you nothing about which field in the schema blew up.

This release captures all unexpected raises in a cast function and re-raises a DataSchema.CastFunctionError with information about which field blew up.

This has proven very helpful for larger schemas.

The DataSchema.CastFunctionError wraps the exception that it catches meaning you can still pattern match on it if you wish to catch some exceptions yourself, for example:

try do
  DataSchema.to_struct(my_input, MySchema)
rescue
  %DataSchema.CastFunctionError{wrapped_error: %RuntimeError{}} ->
    Logger.error("Runtime Error!")
    ...
  error ->
    reraise error, __STACKTRACE__
end

Example error message:

     ** (DataSchema.CastFunctionError)

     Unexpected error when casting value "my_input_value"
     for field :comments in this part of the schema:

     list_of: {:comments, "comments", StringType},

     Full path to field was:

                Field  :comments in MySchema
     Under Field  :metadata in MyParentSchema

     The casting function raised the following error:

     ** (RuntimeError) An error occured!