Storing embedded schemas using :erlang.term_to_binary (ETF) instead of JSON?

tme_317 · March 8, 2019, 1:42pm

I have a data warehousing project where I need to store graphs of structs imported from hundreds of different data sources in a table. I know the typical way to do that (short of creating hundreds of associated tables) is to use Ecto’s embedded schemas and add a payload field of type :map to the database… in PostgreSQL under the hood it uses JSONB and serializes/de-serializes automatically.

Since I don’t ever have to use PG to query on the internals of the stored graphs and just want to be able to load the structs at runtime for rendering I was thinking about using a bytea field and a custom Ecto type to persist my arbitrary Elixir structs with compression (these highly nested structs can be huge) using :erlang.term_to_binary then deserialize with Plug.crypto.safe_binary_to_term when reading. Also I can de-dupe by storing a :sha256 of the payload as the primary key of this massive table since many of these imported data sources change very slowly and I will have to pull periodic updates from the sources.

Of course I plan to test to see how much performance and storage usage differs vs JSON but I don’t see why to convert to JSON and back when the data stays in Elixir for its entire lifetime and is never sent to JS or other external systems. The only thing that seems a bit scary to me when upgrading OTP versions is https://bugs.erlang.org/browse/ERL-431 but I guess I can always de-serialize, re-serialize, and recompute the new hash if/when they change ETF encoding in the future.

Is this a bad anti-pattern as I don’t see much evidence of this approach being used in the community? I’ve found several threads related to term_to_binary on this forum but none exactly applies to my idea.

lpil · March 8, 2019, 2:07pm

Seems sensible to me. Rihanna stores Job data using term_to_binary and it works well so far, though not being able to query it from Postgres can be annoying (thought this isn’t an issue for you).

I remember Joe Armstrong saying that term_to_binary is one of the best things about Erlang and that it’s rather underused.

tme_317 · March 8, 2019, 2:49pm

Thanks, really appreciate the feedback and also the example of storing terms in DB from a community library.

Offtopic but really loving what you are doing creating Gleam!

keathley · March 8, 2019, 3:02pm

We use term_to_binary and binary_to_term in a few specific places. Both functions are very useful. You do need to ensure that you aren’t going to deserialize anything potentially malicious since it’s possible to load functions and such.

You should also make sure to benchmark both against Jason or jiffy or your json serializer of choice. We tested against some of our payloads and term_to_binary was quite a bit slower then just using json. Those situations were definitely edge cases but you’ll want to test to make sure you’re getting the benefits you want.

ConnorRigby · March 8, 2019, 3:47pm

I’d also like to point out that the ETF is versioned, and an updated version could break stored data causing complex data migration. Look at the docs for more info. i seem to remember them talking about this in the distribution section.