XmlSchema - Parse and generate XML with a schema DSL

danj · October 20, 2023, 7:46pm

New! Parse (and generate) XML with a DSL that is built on top of Ecto.Schema. Makes handling XML easy if you want structs out of XML input.

On hex: XmlSchema

hexdocs: XmlSchema

An example:

defmodule Simple do
  use XmlSchema, xml_name: "a"
  xml do
    xml_tag :x, :string
    xml_tag :y, :boolean
    xml_one :z, Z do
      xml_tag :a, :string
      xml_tag :b, :string
    end
    xml_many :j, J do
      xml_tag :q, :string
    end
    xml_tag :g, {:array, :string}
  end
end

This example is illustrated in the module doc

Eiji · October 21, 2023, 1:23am

To have some context let’s use you example XML from documentation:

Example file

xml = """
<?xml encoding="utf-8" ?>
<a someattr="blue" otherattr="red">
  <x>hill</x>
  <y>false</y>
  <z>
    <a>tree</a>
    <b>bush</b>
  </z>
  <j>
    <q>cat</q>
  </j>
  <j>
    <q>dog</q>
  </j>
  <g>hippo</g>
  <g>elephant</g>
  <g>rhino</g>
</a>
"""

First of all your code is inspired by ecto, but have a different naming and way too many logic is in one file.

Updated example schema definition

defmodule Example do
  use YourLibName.Schema

  schema "a" do
    field :x, :string
    field :y, :boolean

    embeds_one :z, Z do
      field :a, :string
      field :b, :string
    end

    embeds_many :j, J do
      field :q, :string
    end

    field :g, :sring
  end
end

While I understand that using the existing ecto schema in some cases may be even impossible, I still recommend to support such schema, so in some cases developers could use an existing schemas and define their own ones only when needed.

Unfortunately _attributes tag name is correct, so even if it’s an edge case we still support it. Therefore it’s much easier to deal with attrs and contents fully separately i.e. we should use a map with such 2 keys.

Aggregate is really helpful, but not always desired. Regardless of what’s your defaults (if any) I would recommend to add an option to disable or enable it. This however prevents us from generating maps. Since we have a Keyword lists it’s really not a big deal and also it allows us to preserve the order which is really important in few cases especially when we want to re-encode said xml document.

Same goes for working with whitespace characters. I even gave a real world example for floki library in the Floki removes blank text nodes without option to avoid this #75 issue.

Here are some examples I have prepared:

YourLibName.decode!(xml, aggregate_adjacent_siblings: false, skip_empty_text_nodes: false)

%Example{
  __meta__: %LibName.Schema.Metadata{schema: Example, source: "inline"}
  attrs: %{"otherattr" => "red", "someattr" => "blue"},
  children: [
    _: "\n  ",
    x: "hill",
    _: "\n  ",
    y: false,
    _: "\n  ",
    z: %Example.Z{
      __meta__: %LibName.Schema.Metadata{schema: Example.Z, source: "inline"}
      attrs: [],
      children: [_: "\n    ", a: "tree", _: "\n    ", b: "bush", _: "\n  "]
    },
    _: "\n  ",
    j: [
      _: "\n    ",
      %Example.J{
        __meta__: %LibName.Schema.Metadata{schema: Example.J, source: "inline"}
        attrs: [],
        children: [q: "cat",],
      },
      _: "\n    ",
      %Example.J{
        __meta__: %LibName.Schema.Metadata{schema: Example.J, source: "inline"}
        attrs: [],
        children: [q: "dog",],
      },
      _: "\n  "
    ],
    _: "\n  ",
    g: "hippo",
    _: "\n  ",
    g: "elephant",
    _: "\n  ",
    g: "rhino",
    _: "\n"
  ]
}

YourLibName.decode!(xml, aggregate_adjacent_siblings: false, skip_empty_text_nodes: true)

%Example{
  __meta__: %LibName.Schema.Metadata{schema: Example, source: "inline"},
  attrs: %{"otherattr" => "red", "someattr" => "blue"},
  children: [
    x: "hill",
    y: false,
    z: %Example.Z{
      __meta__: %LibName.Schema.Metadata{schema: Example.Z, source: "inline"},
      attrs: [],
      children: [a: "tree", b: "bush"]
    },
    j: [
      %Example.J{
        __meta__: %LibName.Schema.Metadata{schema: Example.J, source: "inline"},
        attrs: [],
        children: [q: "cat"]
      },
      %Example.J{
        __meta__: %LibName.Schema.Metadata{schema: Example.J, source: "inline"},
        attrs: [],
        children: [q: "dog"]
      },
    ],
    g: "hippo",
    g: "elephant",
    g: "rhino"
  ]
}

YourLibName.decode!(xml, aggregate_adjacent_siblings: true, skip_empty_text_nodes: false)

%Example{
  __meta__: %LibName.Schema.Metadata{schema: Example, source: "inline"}
  attrs: %{"otherattr" => "red", "someattr" => "blue"},
  children: [
    _: "\n  ",
    x: "hill",
    _: "\n  ",
    y: false,
    _: "\n  ",
    z: %Example.Z{
      __meta__: %LibName.Schema.Metadata{schema: Example.Z, source: "inline"}
      attrs: [],
      children: [_: "\n    ", a: "tree", _: "\n    ", b: "bush", _: "\n  "]
    },
    _: "\n  ",
    j: [
      _: "\n    ",
      %Example.J{
        __meta__: %LibName.Schema.Metadata{schema: Example.J, source: "inline"}
        attrs: [],
        children: [q: "cat",],
      },
      _: "\n    ",
      %Example.J{
        __meta__: %LibName.Schema.Metadata{schema: Example.J, source: "inline"}
        attrs: [],
        children: [q: "dog",],
      },
      _: "\n  "
    ],
    _: "\n  ",
    g: "hippo",
    _: "\n  ",
    g: "elephant",
    _: "\n  ",
    g: "rhino",
    _: "\n"
  ]
}

YourLibName.decode!(xml, aggregate_adjacent_siblings: true, skip_empty_text_nodes: true)

%Example{
  __meta__: %LibName.Schema.Metadata{schema: Example, source: "inline"},
  attrs: %{"otherattr" => "red", "someattr" => "blue"},
  children: [
    x: "hill",
    y: false,
    z: %Example.Z{
      __meta__: %LibName.Schema.Metadata{schema: Example.Z, source: "inline"},
      attrs: [],
      children: [a: "tree", b: "bush"]
    },
    j: [
      %Example.J{
        __meta__: %LibName.Schema.Metadata{schema: Example.J, source: "inline"},
        attrs: [],
        children: [q: "cat"]
      },
      %Example.J{
        __meta__: %LibName.Schema.Metadata{schema: Example.J, source: "inline"},
        attrs: [],
        children: [q: "dog"]
      },
    ],
    g: ["hippo", "elephant", "rhino"]
  ]
}

Those are simplest examples, but there are other things to cover:

What to do when said xml document have tags we don’t declare in schema (because for example we don’t need a data from them). You can add strict_document_structure boolean option, so in let’s say updated documents for some standard (like newer XML API).
How to properly deal with attributes. What if we expect some url which is supposed to be in some attribute? Should strict_document_structure be also used for attributes? You definitely need some attr DSL.
There is no information about comments. Since we can encode xml back we most probably want to have said document properly updated without anything missing.

Finally some ideas/questions about your code:

The links in documentation does not works. ex_doc fallbacks to default branch which is main. They should point to a specific version (like within a git’s `tag).
I have no idea about Erlang’s XML parsers. It’s obvious why you didn’t wrote your own, but why did you choose erlsom over Erlang’s xmerl?

github.com

danj3/xml_schema/blob/fa9d3669d1a1247e20a1c936f3b89cf32637006f/lib/xml_schema.ex#L367-L370


      
          def module_tail_to_string(module) do
            [name | _] = Module.split(module) |> Enum.reverse()
            name
          end

The above code can be written much simpler: module |> Module.split() |> List.last()
Every public function should be documented. Many developers may give up at this point, some may try to check links, but oh, we’re back in 1st point
mix format is your friend. If you are still lonely credo is another one. Even if you want to do everything yourself then he have even it’s own style guide
support directory name is not bad, but more common for your case is fixtures. The first one is general and when developer see it then first thing coming to mind is phoenix stuff. fixtures is more explicit naming.
File.read calls are not best if you can do that in compile-time.

for name <- ~w[first second third] do
  path = Path.join([__DIR__, "fixtures", name <> ".xml")
  xml = File.read!(path)
  def get_xml(unquote(name)), do: unquote(xml)
end

You can have both your fixtures and xml in same directory or even in same file. It’s even better than extra File.read!/1 call:

defmodule MyAppTest.MyFixture do
  # DSL comes here

  def get_xml do
    """
    <?xml version="1.0" encoding="UTF-8" ?>
    <!-- Employee Information-->
    """
  end
end

You can extent the idea above and add a function with expected data i.e. output of XML document parsing. Therefore 99% of your tests looks like:

defmodule MyAppTest do
  use ExUnit.Case

  alias MyAppTest.Fixtures

  for fixture <- [Fixtures.First, Fixtures.Second, Fixtures.Third] do
    test "parses #{inspect(unquote(fixture))}" do
      fixture = unquote(fixture)
      xml = fixture.get_xml()
      assert parse(xml) == fixture.get_expected_data()
    end
  end

  # the rest are
  # edge cases
  # error handling
  # and so on …
end

Search for inspiration. What I written above does not comes from my mind. Both metaprogramming and naming are well covered in floki, jason, ecto and elixir documentations.

danj · January 12, 2024, 2:09am

Released to hex, Version 1.3, improved docs, some attribute handling fixes, better generation support for custom types and arrays, more tests and generation of document examples from tests.

danj · January 29, 2024, 3:03am

Update to 1.3.0 with improved docs and some refactoring. Xml can be easy! But, it isn’t, this only makes it easier.

danj · September 21, 2024, 5:22pm

xml_schema 2.0.2 released! Use xml as bi-directional structs (extending ecto schema).

In this update:

handle new ecto type internals
change tag transform, fixes atom exhaustion when working with permutating xml tags

xml schema 2.0.2 on hex

FeeJai · October 28, 2024, 5:19pm

I just tried that and had a couple of issue with the xml namespaces. Is there a way to add this?

danj · October 28, 2024, 5:51pm

Namespaces are supported, although not documented. For any tag with a namespace, an attribute of _ns will be set with the namespace. If this doesn’t work for you please detail your situation.

FeeJai · October 28, 2024, 9:02pm

I have found the _ns attribute, but this fails on saving to xml. My case is even more complex, though.

I am working on xml files used by tax authorities. They come with xsd files for validation. A good example is this format defined by the OECD (user guide and xsds in the link): OECD temporary archive

The file I am trying to generate should then look like this one: oecd_cbcr_example.xml · GitHub

This requires two different namespace prefixes in the tags, to stay compliant with the standard (“cbc” and “cbcstf” in my example). Any way to do that?

danj · October 30, 2024, 11:27pm

The generation side doesn’t use the _ns attributes to add namespaces on output at this point. The information present in the attribute isn’t used at this point. I have some thoughts about how to produce the namespaced xml you’re looking for, but not currently the time to put toward it.

Created issue namespace output currently unsupported · Issue #1 · danj3/xml_schema · GitHub

FeeJai · November 1, 2024, 9:44pm

Thank you. I might find some time to read through your code this weekend.