Data serialization formats

I’d like to start a discussion of data serialization formats, in the context of Elixir. The rest of this note is a combination of personal opinions and links to useful resources; feel free to jump in with your own clues, pointers, reactions, war stories, etc. (ducking…)

edn

edn (extensible data notation) is a subset of Clojure, extracted by Rich Hickey. edn has a rich set of built-in data types, most of which are a good match for Elixir. In addition, it has a mechanism for extending this set with custom data types.

However, because edn is closely tied to Clojure, Transit (see below) may be a better choice for interoperability. For details, see edn’s GitHub page and eden’s Hex page.

JSON

JSON is a subset of JavaScript, extracted by Douglas Crockford. Although JSON shines at interoperability and standardization, it has very limited (and JavaScript-specific) data types. So, for example, I wouldn’t recommend it for cases where one needs to retain and/or transmit specific data types.

JSON is also poorly suited for generating human-readable documents. It’s possible to include comments by using data elements, but this is a hack. And, although it’s quite possible to format JSON nicely, many programmers don’t make the effort. So, a lot of JSON “in the wild” is difficult for humans to read.

JSON-LD (JavaScript Object Notation for Linked Data) is a JSON-based method of encoding linked data. Thus, it can take the place of RDF-encoding formats such as N-Triples, RDF/XML, and Turtle.

TOML

TOML is an acronym for “Tom’s Obvious, Minimal Language”, referring to its creator, Tom Preston-Werner. Although TOML has very limited data types, it excels at generating human-readable documents.

Because the top of each “section” (i.e., sub-tree) can be encoded as a path, TOML works well for encoding deeply-nested hierarchical structures:

[a.b.c.d]
  e = 42

Transit

Transit is conceptually similar to edn, in that it is an extensible format with strong data type capabilities. However, it is considerably less tied to the Clojure language. Also, its “wire format” uses JSON or MessagePack. For details, see transit_elixir’s Hex page.

YAML

YAML (“YAML Ain’t Markup Language”) is generally well suited to writing by humans, although the need for multiple levels of indentation can become an issue for deeply nested trees. Also, the syntax definition is rather large, so reading some YAML documents can be difficult. Finally, because YAML “in the wild” isn’t well standardized, interoperability can be an issue.

9 Likes

I had to serialize Elixir data structures before pushing to kafka topic I used :erlang.term_to_binary/1. The performance was good and I was easily able to deserialize the data on the consumer side(also running Elixir). This function is also useful when working with C nifs.

I noticed that I can push raw Elixir data structures to RabbitMQ without any serialization at all. Probably because it is written in Erlang.

When dealing with different technologies I’ve found JSON the easiest to work with because of the wide use and support it has.

3 Likes

What about asn.1? It is part of OTP and it is a fully fledged standardised implementation for binary encoding/decoding. If you never used it, protobufs is basically a reimplementation of a small set of features from asn.1 .

While very few people use this protocol, I think it has very big potential in systems where data consistency matters.

The only issue currently is that using it from elixir is very hard, it needs a wrapper with updated documentation.

3 Likes

I use MessagePack for internal communication over HTTP because it is smaller than JSON and that sort of thing floats my boat.

1 Like

Joe Armstrong had a lot to say about serialization formats, although he mostly couched it in terms of wire protocols, etc. In his talk, The How and Why of Fitting Things Together, he talks about why we need a way to ensure that the bits going down the wire obey a “contract”.

At about this point in the talk, he talks about the need for formal protocol definitions and briefly mentions UBF, his (proposed) “Universal Binary Format”. When I went off to look for information on UBF, I found the following URLs:

The “UNMAINTAINED” tag is a bit troubling, but maybe it doesn’t reflect actual adoption. So, is anyone actually using UBF? If so, could you provide some useful feedback on it? Inquiring gnomes need to mine…

-r

1 Like