Acceptance of Erlang's 'term_to_binary' and vice versa in Elixir

joeerl · December 22, 2017, 11:29am

I’m playing with Elixir - It’s fun. I think @rvirding does give Elixir courses these days.

Re: files and database - when I given Erlang courses I say the three best things about erlang are processes, links and term_to_binary and the same should be true of Elixir (since these have nothing to do with the surface syntax and are properties of the beam VM)

I may be wrong but I don’t seem much use of term_to_binary in Elixir - this is the great way to store anything on disk or communicate between systems. Is this a well know way of storing and retrieving data to disk?

The post above has been split into a new thread, for context, here is Joe’s original post that this conversation stems from:

What makes a web application fast?

You have to remember that any SQL database will be a bottleneck since there is an impedance mismatch between the way dayt is represented in a daybase and the way it is represented in the beam VM.

Databases are basically rectangular tables of cells, where the cells contain very simple types like strings and integers - every time you access a row of an external database this list of cells has to be converted to beam internal data structures - this conversion is extremely expensive.

The best way to persist data is in a process - then no conversion is needed, but this is not fault tolerant - so you need to keep a trail of updates to the data and store this on disk.

Often you don’t need a database for example you might like to have a system where you store all the user data as in the file system with “one file per user” this will scale very nicely - just move the files to a new machine if you need more capacity.

Erlang has two primitives term_to_binary and the inverse binary_to_term that serialise any term and reconstruct it - so storing complex terms on disk is really easy.

I have mixed feelings about databases, they are great for aggregate operations (for example, find all users that have these attributes) but terrible for operations on individual users (where a single file per user is far better).

If I were designing a new system I’d go for ‘one file per user’ as much as possible and try to limit databases for operations over all users.

If you look at how many programs are designed you’ll see they follow this principle. Apple stores all images in the file system (hidden a bit) and has a database with metadata about the files. This is good since the database is small
and many operations can be performed with minimal use of the database. What they do not do is put all the data in a database - there are good reasons for this.

NobbZ · December 22, 2017, 11:52am

I’m not sure if this is well known in elixir world, but I had to sidestep over erlang after starting with elixir, because my targeted system had only OTP 16 available then and Elixir needed 17 at least. So I was using erlang exclusively for a year then. I learned about those nice little helpers back then.

But from elixir its hard to find them, since they are not available in the auto-imported modules, you have to know they are there and you have to explicitely import/remote call them, either as import :erlang, only: [binary_to_term: 1, term_to_binary: 1] or :erlang.term_to_binary("foo").

I think if they were documented and autoinlined from Kernel, their discoverability in elixir were much better and they would be used more then.

joeerl · December 22, 2017, 1:04pm

I thought so.

When I teach I go on an on and on about the greatness of term_to_binary and the inverse. These are incredibly useful.

They are also blindingly fast compared to any JSON/XML type serialisation.

Something like:

defmodule Term do

	def store(anything, path) do
 		bin = :erlang.term_to_binary(anything)
    	File.write!(path, bin)
    end

   	def fetch(path) do
    	File.read!(path) |> :erlang.binary_to_term
    end
end

Should do the trick

NobbZ · December 22, 2017, 2:09pm

I made a proposal on elixir-core, it wont make it in 1.6 though, José once said he plans to release it early in January and I think I’ve read somewhere that he is already preparing its release… Anyway, the proposol needs to get accepted first, then implementation should be straight forward…

LostKobrakai · December 22, 2017, 2:15pm

At least in my head there’s also always that mantra of “the filesystem is slow”, which of corse is not that big an issue if the filesystem is not read for each request. So besides discoverability it’s probably also a case of educating/informing people.

NobbZ · December 22, 2017, 2:23pm

Of course, the filesystem is slow, but storing and loading in binary format straight into a file might be much faster than doing the same with JSON or stringified in an arbitrary database protocol.

Also, at least for me, filesystem does often just mean a dedicated area of memory, aka ram-drive. Its easier to share filenames on a ram drive with external processes than doing everything via stdin/out in a port. Sometimes I do even have applications external to the beam that read those files.

Also, when you want to persist you need to write to the filesystem, either you do it from the beam directly, or you push your data to a database, which will then persist to disk as well…

So somewhere in the process of persisting data you will always hit the disk.

ryh · December 22, 2017, 2:24pm

Yes. I’m using it to snapshot process state on disk. It works like a charm.

gon782 · December 22, 2017, 3:13pm

I don’t know if I find that to be a great reason to not expose term_to_binary, seeing as it’s not actually only for writing to disk. It’s just a serialization of any erlang term, so you can use it on the wire as well.

NobbZ · December 22, 2017, 3:26pm

Oh, and especially as filesystem and network is slow, I strongly prefer a binary serialisation format over a human readable plaintext format like JSON or XML (at least when I do not expect humans to read the output).

term_to_binary has even built in compression, so binarified chunks of data are much smaller than the same term serialised to JSON or XML.

LostKobrakai · December 22, 2017, 3:27pm

I’m not arguing against that. My comments were on the topic of people using databases instead of file based persistance directly from elixir/erlang.

NobbZ · December 22, 2017, 3:29pm

And we are just giving away information about why that mantra is obviously correct, but also why especially because of its truthiness we want to use t2b. Please do not take it as an offense, its just that you were the one who actually spoke it out loud.

AstonJ · December 22, 2017, 4:08pm

From my perspective I think it is known as I’ve seen it mentioned a few times in the books and courses I’ve done - pretty sure @sasajuric mentions it a few times in Elixir in Action.

However, what I don’t think is well known is what you said in your other post - that it could well be a much better way to store certain kind of data.

Norbert has split your post into a dedicated thread so I’ll add your original post as a quote to your post above.

sasajuric · December 22, 2017, 4:18pm

Yeah, I’ve used :erlang.term_to_binary to implement a very simple ad-hoc database. My main motivation in the book was to keep things simple. Using a full-blown database would have required installation of some piece of software, introduction of mix project and OTP application, and addition of another dependency. I definitely didn’t want to deal with all that in the chapter which explains GenServer

I also occasionally reach for term_to_binary in real life, for some simple nice-to-have short-term persistency. I think it’s a great no-ceremony, no-impedance-mismatch fit for such scenarios.

We almost ended up storing encoded terms to PostgreSQL in one case, but we decided against it, since we were worried about possible future changes to the format. I saw somewhere (can’t remember where though), that the format rarely changes, but that it can still happen.

chrismccord · December 22, 2017, 4:22pm

The phoenix long poll transport for channels uses term_to_binary to encode the long poll server pid and send it back to the client. When they repoll, we binary_to_term back into a pid to ask the server if it has any messages for us, which has been a fun way to use these features

sasajuric · December 22, 2017, 4:25pm

Agreed! At Aircloak we have two Elixir systems chatting over the socket connection (detailed explanation is in this post). For a long time, we just shipped JSONs over the wire, but at some point we noticed that encoding takes a long time for large payload. After some measurements, we replaced it with term_to_binary/binary_to_term, since it was much faster (even faster than jiffy). As an added bonus there is no impedance mismatch between the data being exchanged

tme_317 · December 22, 2017, 6:00pm

These functions existed in Kernel ~4 years ago and were removed here probably to make the standard library more concise: https://github.com/elixir-lang/elixir/issues/2003

I agree with your proposal term_to_binary and binary_to_term should be re-added to Kernel but also understand it’s tough to know where to draw the line in regards to the breadth of the stdlib… especially auto-imported Kernel functions.

michalmuskala · December 22, 2017, 7:08pm

One should take care, though, when using binary_to_term on data received from the network. The deserialization itself can create resources that are limited in the system, thus leading to DoS (atoms being the primary element). There’s also the issue that it allows for a gzip compression of the data, so it is potentially susceptible to a zip-bomb attack.

While both of the functions aren’t often used directly in the applications, they are used quite frequently in libraries. On top of my head from stuff we use in our production app would be Phoenix.Token and I’m pretty sure there are some other examples.

cmkarlsson · December 22, 2017, 9:32pm

Would have liked a safe option to binary_to_term that was indeed safe. The documentation says it can be used when receiving binaries from an untrusted source. However someone showed me that at least in Elixir this can be used for bad things regardless. Can’t remember the specifics though :/. Does anyone know why binary_to_term is not considered safe even though documentation hints it can be used with the safe option?

Deserialization can be problematic in many formats. XML and Yaml both suffer from DoS attacks. (Billion laughs attack - Wikipedia).

This is especially true if you are using erlang records.

rvirding · December 22, 2017, 10:47pm

Sorry, I don’t get your meaning here. Records haven’t changed format in the about 25 years since they were added to the language. They haven’t changed syntax, Erlang syntax at least, in that time either. They are partly (some say mainly) my fault (so you know who to blame )

jeremyjh · December 22, 2017, 11:19pm

The post you are replying to gave you two examples. You can DoS a server by filling it’s memory with atoms that cannot be garbage collected. Or a zip-bomb.