BERT-ng (temporary name, to be changed) - new version of BERT-RPC spec (WIP)

etf
bert
#1

Maybe some of you know that there was (is?) something like BERT-RPC which in short is simplified version of External Term Format (also known as :erlang.term_to_binary/1). I try to revive this standard a little with my small project BERT-ng which is supposed to be simplified and standardised subset of ETF that will be implementable in other languages.

Let me know what do you think, and what kind of improvements you would like to see in it. I will also try to provide some implementations in different languages for sake of interoperability.

Oh, and by the way, I am looking for better name, if you have any suggestions then I am open to it.

9 Likes
#2

This is great! Let me start by saying that we are planning to send Erlang Term Format from channels in Phoenix v1.5, as that will use less CPU and save bandwidth. So work on this is definitely welcome as well as an efficient decoder in JS (the client will continue sending JSON though).

With that said and since you are planning to drop RPC anyway, I would call it SETF - Sequential Erlang Term Format. It is sequential because it does not include PIDs and others.

With that said, my take is:

  • Should we support atoms?

Yes.

  • Should we support byte lists (aka Erlang strings)?

They should just be a list of integers. I.e. they should not have a special treatment. (btw, Erlang strings are a list of chars)

  • Should we allow defining improper lists?

I am personally indifferent.

4 Likes
#3

My problem with atoms is that decoding ETF with safe flag set disables all atom decoding and this was the approach that I have taken here. Additionally atoms can be encoded with atoms cache or in inline form, I wanted to remove them to avoid this ambiguity.

I agree, the question is stated because ETF have special form for list of bytes and I do not know whether we should support this syntax as well.

#4

That’s awesome news!

Do you plan to optionally support Erlang Term Format? Some projects does not care about how much client needs to do, but bandwidth is still a big problem for people which does not have fiber cable. I’m in small city in Poland and I could use fiber internet only after relocation. In my old flat there is still no fiber connection.

Yes, but developer in other languages could be confused as its name is sometimes different like symbols, so I think that atoms/symbols should be mentioned there.

I think like @josevalim, because I don’t like overcomplicated solutions.

I’m not sure, because I did not read about similar things when I learned about basics of other languages (which not surprised me as for some reason I mostly take a look at OOP languages). If it have value even for 1% of them then I would add it, but if not then again we do not need extra complication.

I’m sure that Erlang has support for it - if other languages based on Erlang also supports it then I would definitely say yes.

1 Like
#5

I am in big city in Poland and I totally understand your pain as I have no fiber as well (recently there were pamphlets that it will be possible soon though), and I live 1km away from HQ of one of the biggest IT companies in country.

It was meant not to not support [1, 2, 3] at all, but to not use “special string syntax” for them (STRING_EXT) and instead force implementation to always use “full list syntax” (aka LIST_EXT).

All of them supports it, in Elixir it is simple as [a | b] where not is_list(b). My problem is that something like that is supported only in dynamically typed functional languages (Erlang and Lisps) and I have never encounter non-functional language that would support such. That is why I disallowed improper lists and force all lists to be proper.

#6

Yeah, I got it at start and still have same opinion like José. As long as we can pass Erlang string as list of integers as long we don’t need to overcomplicate standard and add extra case for them. Some people could think that it would be helpful for other which would know what server means by using such special type, but it’s something which is included in @spec and optionally mentioned in @doc instead of special STRING_EXT standard part.

That’s really simple … On Socket join save from params to socket assigns boolean flag like: %{"imporer_list_optimization" => true}, so every channel could call helper function which ensures that improper list are passed or not based on that boolean flag. Maybe it could be even included in Phoenix itself on encoding process …

#7

As long as you document that unknown atoms may be refused, I think you should be fine? And no atoms cache, for simplicity.

Ah, I see. Then yes, we need to handle it if you want to leverage term_to_binary out of the box. :frowning:

Not sure. But note that all of this can already be achieved today too with custom serializers.

1 Like
#8

I still do not think that this is good idea. Instead I would prefer to encode all atoms as strings as this will simplify implementation in other languages as for example JavaScript do not have atom equivalent (symbols are immutable and each time you create new one, so Symbol('foo') != Symbol('foo') as each of them is created separately and there is no way to get “existing symbol” with given name).

#9

Hmm? Safe just means it won’t create ‘new’ atoms but it still accepts existing loaded atoms, which is good for ‘keys’ and such that are matched on. :slight_smile:

Modern javascript actually does have an atom’ish Symbols type, that could be a useful binding between the two actually in supported browsers (all evergreen browsers nowadays)…

Sentinal/end-of-list types that are additionally data is, though not common, is not unheard of, so supporting improper lists would be useful and in JS it could be represented by a wrapped final element instead of it being ‘naked’, unless using a full custom ‘list’ type that is actually a proper list (2-tuple in javascript Cons cells), in which case can just use that straight instead of pretending that lists on the wire are just JS arrays (which do have different semantics).

Yep, you’d need to hold a cache of that, which is a good thing to do anyway as you could have a library on both sides and actually use the atom cache part of ETF to shrink the on-wire size of the message too!

#10

Yeah, I am just thinking if there would be much need for such. My main point is that it is supposed to be inter-language format and there are languages that do not have concept of “symbol” at all. So would it be worth having support for atoms?

Yes, they have different semantics, but as I said earlier, I want it to be inter-language, so I was thinking about using whatever sequential data container in target language, so for JS/Rust/etc. that would be an array.

That would be quite painful in the end, as it would require to pass map of expected symbols with appropriate Symbol instances. I wanted to avoid such problem by disallowing atoms in general, and forcing strings-only approach.

#11

Eh, if someone wants their on-wire to be language agnostic then using a base representation of things is fine, so JSON is perfectly fine then. Using something like bert/etf is not even always a space optimization, but rather it is a type optimization so you can send over types more accurately, so I’d always opt for the most ‘accurate’ representation possible, falling back to wrappers (like for atoms in languages that don’t have an atom/symbol/etc type that makes sense to map to then have it be some wrapped string type called an ‘atom’ instead, most languages even allow such wrapping with no overhead as well, like rust/ocaml/kotlin/C/C++/etc…). There are huge semantic differences, even if not big encoding differences, between things like atoms and strings, or strings and charlists, or a proper list and an improper list and an array, and those semantic differences can be very important for data encoding purposes.

Then if someone wants an atom they need to wrap it up in a tuple or list and state it as such or something, you are losing expressibility. Holding a cache of it is not really painful at all, I’ve done it in both python and C++ with ETF data transferring so as to minimize network load. Each side just sends the full atom as normal each the first time, marking that in their cache that they’ve done so along with the registration id for it, then just send the atom id after that. It’s a simple check of if in-cache then send-id else add-to-cache-and-send-atom-and-id, no need to send a map of expected symbols, although that is entirely an option for some optimizations reasons, but that’s not anything I did myself and rather just sent the mappings on demand.

#12

This is 100% something I do not want. I do not want atom cache which is present in the original ETF specs. If you want something like that, then BERT-ng isn’t solution for you. BERT-ng is more replacement for msgpack rather than full standardisation of the ETF.

#13

Ah didn’t realize it was more to replace msgpack, what does it bring over msgpack in that case especially considering how much support msgpack has now?

#14

Less types and possible encodings, no bit mangling for storing 7-bit decimals, TLV form of encoding is used almost always, built in support for big ints. Other than that, not much except being literally subset of ETF, so encoding is as simple as validating for incorrect data and calling :erlang.term_to_binary/1.

1 Like
#15

That is why I disallowed improper lists and force all lists to be proper.

Maybe loudly disallowing them in the spec yet making the encoder smart enough to rewrite as proper lists would make sense (possibly only with a relevant flag/option, so it’s not completely hidden)?

Solid implementations in other languages would rock!

2 Likes